This repository contains comprehensive fine-tuning recipes for Vision-Language Models (VLMs), supporting both **image** and **video** understanding tasks with modern models and training techniques.
## 🎯 Available Recipes
### 📸 Image VLM Fine-tuning (`ui_image/`)
- **Model**: Qwen2.5-VL-7B-Instruct
- **Task**: Wildfire detection from satellite imagery
> **Note**: The same Docker container and launch commands work for both image and video VLM recipes. The container includes all necessary dependencies including FFmpeg, Decord, and optimized libraries for both workflows.
## 📚 Detailed Instructions
Each recipe includes comprehensive documentation:
- **[Image VLM README](ui_image/README.md)**: Complete guide for wildfire detection fine-tuning with Qwen2.5-VL, including dataset setup, GRPO training configuration, and interactive inference
- **[Video VLM README](ui_video/README.md)**: Full walkthrough for dangerous driving detection with InternVL3, covering video data preparation, training notebooks, and structured output generation