chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-08 18:42:07 +00:00
parent 4a14a6d298
commit 54920e66a0
2 changed files with 103 additions and 19 deletions

View File

@ -40,7 +40,7 @@ The setup includes:
## Time & risk
**Duration**:
- 15 minutes for initial setup model download time
- 30-45 minutes for initial setup model download time
- 1-2 hours for dreambooth LoRA training
**Risks**:
@ -85,9 +85,10 @@ If you do not have a `HF_TOKEN` already, follow the instructions [here](https://
```bash
export HF_TOKEN=<YOUR_HF_TOKEN>
cd flux-finetuning/assets
cd dgx-spark-playbooks/nvidia/flux-finetuning/assets
sh download.sh
```
The download script can take about 30-45 minutes to complete based on your internet speed.
If you already have fine-tuned LoRAs, place them inside `models/loras`. If you do not have one yet, proceed to the `Step 6. Training` section for more details.
@ -120,27 +121,29 @@ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
## Step 5. Dataset preparation
Let's prepare our dataset to perform Dreambooth LoRA fine-tuning on the FLUX.1-dev 12B model. However, if you wish to continue with the provided dataset of Toy Jensen and DGX Spark, feel free to skip to the Training section below. This dataset is a collection of public assets accessible via Google Images.
Let's prepare our dataset to perform Dreambooth LoRA fine-tuning on the FLUX.1-dev 12B model.
You will need to prepare a dataset of all the concepts you would like to generate and about 5-10 images for each concept. For this example, we would like to generate images with 2 concepts.
For this playbook, we have already prepared a dataset of 2 concepts - Toy Jensen and DGX Spark. This dataset is a collection of public assets accessible via Google Images. If you wish to generate images with these concepts, you do not need to modify the `data.toml` file.
**TJToy Concept**
- **Trigger phrase**: `tjtoy toy`
- **Training images**: 6 high-quality images of custom toy figures
- **Training images**: 6 high-quality images of Toy Jensen figures available in the public domain
- **Use case**: Generate images featuring the specific toy character in various scenes
**SparkGPU Concept**
- **Trigger phrase**: `sparkgpu gpu`
- **Training images**: 7 images of custom GPU hardware
- **Training images**: 7 images of DGX Spark GPU available in the public domain
- **Use case**: Generate images featuring the specific GPU design in different contexts
If you wish to generate images with custom concepts, you would need to prepare a dataset of all the concepts you would like to generate and about 5-10 images for each concept.
Create a folder for each concept with its corresponding name and place it inside the `flux_data` directory. In our case, we have used `sparkgpu` and `tjtoy` as our concepts, and placed a few images inside each of them.
Now, let's modify the `flux_data/data.toml` file to reflect the concepts chosen. Ensure that you update/create entries for each of your concept by modifying the `image_dir` and `class_tokens` fields under `[[datasets.subsets]]`. For better performance in fine-tuning, it is good practice to append a class token to your concept name (like `toy` or `gpu`).
Now, let's modify the `flux_data/data.toml` file to reflect the concepts chosen. Ensure that you update/create entries for each of your concepts by modifying the `image_dir` and `class_tokens` fields under `[[datasets.subsets]]`. For better performance in fine-tuning, it is good practice to append a class token to your concept name (like `toy` or `gpu`).
## Step 6. Training
Launch training by executing the follow command. The training script is set up to use a default configuration that can generate reasonable images for your dataset, in about ~90 mins of training. This train command will automatically store checkpoints in the `models/loras/` directory.
Launch training by executing the following command. The training script uses a default configuration that produces images that capture your DreamBooth concepts effectively after about 90 minutes of training. This train command will automatically store checkpoints in the `models/loras/` directory.
```bash
## Build the inference docker image

View File

@ -82,16 +82,17 @@ sudo usermod -aG docker $USER
In a terminal, clone the repository and navigate to the VLM fine-tuning directory.
```bash
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main dgx-spark-playbooks
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/
```
## Step 3. Build the Docker container
Build the Docker image. This will set up the environment for both image and video VLM fine-tuning:
Build the Docker image. This will set up the environment for both image and video VLM fine-tuning.
Please export your Hugging Face token as an environment variable - `HF_TOKEN`. You may encounter warnings when building the image. This is expected and can be ignored.
```bash
## Enter the correct directory for building the image
cd vlm-finetuning/assets
cd dgx-spark-playbooks/nvidia/vlm-finetuning/assets
## Build the VLM fine-tuning container
docker build --build-arg HF_TOKEN=$HF_TOKEN -t vlm_demo .
@ -118,9 +119,30 @@ hf download Qwen/Qwen2.5-VL-7B-Instruct
If you already have a fine-tuned checkpoint, place it in the `saved_model/` folder. Note that your checkpoint number can be different. For a comparative analysis against the base model, skip directly to the `Finetuned Model Inference` section.
#### 5.2. Download the wildfire dataset from Kaggle and place it in the `data` directory
#### 5.2. Download the wildfire dataset
The wildfire dataset can be found here: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset.
The project uses a **Wildfire Detection Dataset** with satellite imagery for training the model to identify wildfire-affected regions. The dataset includes:
- Satellite and aerial imagery from wildfire-affected areas
- Binary classification: wildfire vs no wildfire
```bash
mkdir -p ui_image/data
cd ui_image/data
```
For this fine-tuning playbook, we will use the [Wildfire Prediction Dataset](https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset) from Kaggle. Visit the kaggle dataset page [here](https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset) to click the download button. Select the `cURL` option in the `Download Via` dropdown and copy the curl command.
> **Note**: You will need to be logged into Kaggle and may need to accept the dataset terms before the download link works.
Run the following commands in your container:
```bash
## Paste and run the curl command from Kaggle here, and then continue to unzip the dataset
unzip -qq wildfire-prediction-dataset.zip
rm wildfire-prediction-dataset.zip
cd ..
```
#### 5.3. Base model inference
@ -134,26 +156,61 @@ Access the streamlit demo at http://localhost:8501/.
When you access the streamlit demo for the first time, the backend triggers vLLM servers to spin up for the base model. You will see a spinner on the demo site as vLLM is being brought up for optimized inference. This step can take up to 15 mins.
Since we are currently focused on inferring the base model, let's scroll down to the `Image Inference` section of the UI. Here, you should see a sample pre-loaded satellite image of a potentially wildfire-affected region.
Enter your prompt in the chat box and hit `Generate`. Your prompt would be first sent to the base model and you should see the generation response on the left chat box. If you did not provide a fine-tuned model, you should not see any generations from the right chat box. You can use the following prompt to quickly test inference:
`Identify if this region has been affected by a wildfire`
As you can see, the base model is incapable of providing the right response for this domain-specific task. Let's try to improve the model's accuracy by performing GRPO fine-tuning.
#### 5.4. GRPO fine-tuning
We will perform GRPO fine-tuning to add reasoning capabilities to our base model and improve the model's understanding of the underlying domain. Considering that you have already spun up the streamlit demo, scroll to the `GRPO Training section`.
Configure the finetuning method and lora parameters based on the following options.
- `Finetuning Method`: Choose from Full Finetuning or LoRA
- `LoRA Parameters`: Adjustable rank (8-64) and alpha (8-64)
You can additionally choose whether the layers you want to fine-tune in the VLM. For the best performance, ensure that all options are toggled on. Note that this will increase the model training time as well.
In this section, we can select certain model parameters as relevant to our training run.
- `Steps`: 1-1000
- `Batch Size`: 1, 2, 4, 8, or 16
- `Learning Rate`: 1e-6 to 1e-2
- `Optimizer`: AdamW or Adafactor
For a GRPO setup, we also have the flexibility in choosing the reward that is assigned to the model based on certain criteria
- `Format Reward`: 2.0 (reward for proper reasoning format)
- `Correctness Reward`: 5.0 (reward for correct answers)
- `Number of Generations`: 4 (for preference optimization)
After configuring all the parameters, hit `Start Finetuning` to begin the training process. You will need to wait about 15 minutes for the model to load and start recording metadata on the UI. As the training progresses, information such as the loss, step, and GRPO rewards will be recorded on a live table.
The default loaded configuration should give you reasonable accuracy, taking 100 steps of training over a period of up to 2 hours. We achieved our best accuracy with around 1000 steps of training, taking close to 16 hours.
After training is complete, the script automatically merges LoRA weights into the base model. After the training process has reached the desired number of training steps, it can take 5 mins to merge the LoRA weights.
If you wish to stop training, just hit the `Stop Finetuning` button. Please use this button only to interrupt training. This button does not guarantee that the checkpoints will be properly stored or merged with lora adapter layers.
Once you stop training, the UI will automatically bring up the vLLM servers for the base model and the newly fine-tuned model.
#### 5.5. Fine-tuned model inference
Now we are ready to perform a comparative analysis between the base model and the fine-tuned model.
If you haven't spun up the streamlit demo already, execute the following command. If had just just stopped training and are still within the live UI, skip this step.
```bash
streamlit run Image_VLM.py
```
Regardless of whether you just spun up the demo or just stopped training, please wait about 15 minutes for the vLLM servers to be brought up.
Scroll down to the `Image Inference` section and enter your prompt in the provided chat box.
Upon clicking `Generate` your prompt will be first sent to the base model and then to the fine-tuned model. You can use the following prompt to quickly test inference:
Scroll down to the `Image Inference` section and enter your prompt in the provided chat box. Upon clicking `Generate` your prompt will be first sent to the base model and then to the fine-tuned model. You can use the following prompt to quickly test inference:
`Identify if this region has been affected by a wildfire`
@ -161,6 +218,12 @@ If you trained your model sufficiently, you should see that the fine-tuned model
## Step 6. [Option B] For video VLM fine-tuning (Driver Behaviour Analysis)
Within the same container, navigate to the `ui_video` directory.
```bash
cd /vlm_finetuning/ui_video
```
#### 6.1. Prepare your video dataset
Structure your dataset as follows. Ensure that `metadata.jsonl` contains rows of structured JSON data about each video.
@ -175,7 +238,7 @@ dataset/
#### 6.2. Model download
> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the main project README at `vlm-finetuning/assets/README.md`.
> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the section above to `Build the Docker container`.
```bash
hf download OpenGVLab/InternVL3-8B
@ -186,7 +249,7 @@ hf download OpenGVLab/InternVL3-8B
Before going ahead to fine-tune our video VLM for this task, let's see how the base InternVL3-8B does.
```bash
## cd into vlm_finetuning/assets/ui_video if you haven't already
## cd into /vlm_finetuning/ui_video if you haven't already
streamlit run Video_VLM.py
```
@ -196,6 +259,10 @@ When you access the streamlit demo for the first time, the backend triggers Hugg
First, let's select a video from our dashcam gallery. Upon clicking the green file open icon near a video, you should see the video render and play automatically for our reference.
Scroll down, enter your prompt in the chat box and hit `Generate`. Your prompt would be first sent to the base model and you should see the generation response on the left chat box. If you did not provide a finetuned model, you should not see any generations from the right chat box. You can use the following prompt to quickly test inference:
`Analyze the dashcam footage for unsafe driver behavior`
If you are proceeding to train a fine-tuned model, ensure that the streamlit demo UI is brought down before proceeding to train. You can bring it down by interrupting the terminal with `Ctrl+C` keystroke.
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
@ -207,7 +274,7 @@ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```bash
## Enter the correct directory
cd /vlm-finetuning/ui_video/train
cd train
## Start Jupyter Lab
jupyter notebook video_vlm.ipynb
@ -217,6 +284,17 @@ Access Jupyter at `http://localhost:8888`. Ensure that you set the path to your
```python
dataset_path = "/path/to/your/dataset"
```
Here are some of the key training parameters that are configurable. Please note that for reasonable quality, you will need to train your video VLM for atleast 24 hours given the complexity of processing spatio-temporal video sequences.
- **Model**: InternVL3-8B
- **Video Frames**: 12 to 16 frames per video
- **Sampling Mode**: Uniform temporal sampling
- **LoRA Configuration**: Efficient parameter updates for large-scale fine-tuning
- **Hyperparameters**: Exhaustive suite of hyperparameters to tune for video VLM fine-tuning
You can monitor and evaluate the training progress and metrics, as they will be continuously shown in the notebook.
After training, ensure that you shutdown the jupyter kernel in the notebook and kill the jupyter server in the terminal with a `Ctrl+C` keystroke.
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
@ -230,11 +308,14 @@ Now we are ready to perform a comparative analysis between the base model and th
If you haven't spun up the streamlit demo already, execute the following command. If you have just stopped training and are still within the live UI, skip to the next step.
```bash
## cd back to /vlm_finetuning/ui_video if you haven't already
streamlit run Video_VLM.py
```
Access the streamlit demo at http://localhost:8501/.
If you trained your model sufficiently, you should see that the fine-tuned model is able to identify the salient events from the video and generate a structured output.
If you trained your model sufficiently, you should see that the fine-tuned model is able to identify the salient events from the video and generate a structured output.
Since the model's output adheres to the schema we trained, we can directly export the model's prediction into a database for video analytics.
Feel free to play around with additional videos available in the gallery.