mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-22 18:13:52 +00:00
| .. | ||
| ui_image | ||
| ui_video | ||
| Dockerfile | ||
| launch.sh | ||
| README.md | ||
VLM Fine-tuning Recipes
This repository contains comprehensive fine-tuning recipes for Vision-Language Models (VLMs), supporting both image and video understanding tasks with modern models and training techniques.
🎯 Available Recipes
📸 Image VLM Fine-tuning (ui_image/)
- Model: Qwen2.5-VL-7B-Instruct
- Task: Wildfire detection from satellite imagery
- Training Method: GRPO (Generalized Reward Preference Optimization) and LoRA (Low-rank Adaptation)
🎥 Video VLM Fine-tuning (ui_video/)
- Model: InternVL3-8B
- Task: Dangerous driving detection and structured metadata generation
- Training Method: Supervised Fine-tuning on Multimodal Instructions
🚀 Quick Start
1. Build the Docker Container
# Build the VLM fine-tuning container
docker build --build-arg HF_TOKEN=$HF_TOKEN -t vlm_demo .
2. Launch the Container
# Enter the correct directory for building the image
cd vlm-finetuning/assets
# Run the container with GPU support
sh launch.sh
# Enter the mounted directory within the container
cd /vlm_finetuning
Note
: The same Docker container and launch commands work for both image and video VLM recipes. The container includes all necessary dependencies including FFmpeg, Decord, and optimized libraries for both workflows.
📚 Detailed Instructions
Each recipe includes comprehensive documentation:
- Image VLM README: Complete guide for wildfire detection fine-tuning with Qwen2.5-VL, including dataset setup, GRPO training configuration, and interactive inference
- Video VLM README: Full walkthrough for dangerous driving detection with InternVL3, covering video data preparation, training notebooks, and structured output generation