diff --git a/nvidia/flux-finetuning/README.md b/nvidia/flux-finetuning/README.md index 4f956cd..a6f4db5 100644 --- a/nvidia/flux-finetuning/README.md +++ b/nvidia/flux-finetuning/README.md @@ -1,6 +1,6 @@ # FLUX.1 Dreambooth LoRA Fine-tuning -> Fine-tune FLUX.1-dev 11B model using multi-concept Dreambooth LoRA for custom image generation +> Fine-tune FLUX.1-dev 12B model using multi-concept Dreambooth LoRA for custom image generation ## Table of Contents @@ -11,13 +11,13 @@ ## Overview -## Basic Idea +## Basic idea -This playbook demonstrates how to fine-tune the FLUX.1-dev 11B model using multi-concept Dreambooth LoRA (Low-Rank Adaptation) for custom image generation on DGX Spark. +This playbook demonstrates how to fine-tune the FLUX.1-dev 12B model using multi-concept Dreambooth LoRA (Low-Rank Adaptation) for custom image generation on DGX Spark. With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training an image generation model with multiple models loaded in memory, such as the Diffusion Transformer, CLIP Text Encoder, T5 Text Encoder, and the Autoencoder. Multi-concept Dreambooth LoRA fine-tuning allows you to teach FLUX.1 new concepts, characters, and styles. The trained LoRA weights can be easily integrated into existing ComfyUI workflows, making it perfect for prototyping and experimentation. -Moreover, this playbook demonstrates how DGX Spark can not only load several models in memory, but also run train and generate high-resolution images such as 1024px and higher. +Moreover, this playbook demonstrates how DGX Spark can not only load several models in memory, but also train and generate high-resolution images such as 1024px and higher. ## What you'll accomplish @@ -41,13 +41,13 @@ The setup includes: **Duration**: - 15 minutes for initial setup model download time -- 1-2 hours for dreambooth lora training +- 1-2 hours for dreambooth LoRA training **Risks**: - Docker permission issues may require user group changes and session restart - The recipe would require hyperparameter tuning and a high-quality dataset for the best results -**Rollback**: Stop and remove Docker containers, delete downloaded models if needed +**Rollback**: Stop and remove Docker containers, delete downloaded models if needed. ## Instructions @@ -78,110 +78,91 @@ In a terminal, clone the repository and navigate to the flux-finetuning director git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main dgx-spark-playbooks ``` -## Step 3. Build the Docker container +## Step 3. Model Download -This docker image will download the required models and set up the environment for training and inference. -- `flux1-dev.safetensors` -- `ae.safetensors` -- `clip_l.safetensors` -- `t5xxl_fp16.safetensors` +You will have to be granted access to the FLUX.1-dev model since it is gated. Go to their [model card](https://huggingface.co/black-forest-labs/FLUX.1-dev), to accept the terms and gain access to the checkpoints. +If you do not have a `HF_TOKEN` already, follow the instructions [here](https://huggingface.co/docs/hub/en/security-tokens) to generate one. Authenticate your system by replacing your generated token in the following command. ```bash -docker build -f Dockerfile.train --build-arg HF_TOKEN=$HF_TOKEN -t flux-training . +export HF_TOKEN= +cd flux-finetuning/assets +sh download.sh ``` -## Step 4. Run the Docker container +If you already have fine-tuned LoRAs, place them inside `models/loras`. If you do not have one yet, proceed to the `Step 6. Training` section for more details. + +## Step 4. Base Model Inference + +Let's begin by generating an image using the base FLUX.1 model on 2 concepts we are interested in, Toy Jensen and DGX Spark. ```bash -## Run with GPU support and mount current directory -docker run -it \ - --gpus all \ - --ipc=host \ - --ulimit memlock=-1 \ - --ulimit stack=67108864 \ - --net=host \ - flux-training +## Build the inference docker image +docker build -f Dockerfile.inference -t flux-comfyui . + +## Launch the ComfyUI container (ensure you are inside flux-finetuning/assets) +## You can ignore any import errors for `torchaudio` +sh launch_comfyui.sh +``` +Access ComfyUI at `http://localhost:8188` to generate images with the base model. Do not select any pre-existing template. + +Find the workflow section on the left-side panel of ComfyUI (or press `w`). Upon opening it, you should find two existing workflows loaded up. For the base Flux model, let's load the `base_flux.json` workflow. After loading the json, you should see ComfyUI load up the workflow. + +Provide your prompt in the `CLIP Text Encode (Prompt)` block. For example, we will use `Toy Jensen holding a DGX Spark in a datacenter`. You can expect the generation to take ~3 mins since it is compute intesive to create high-resolution 1024px images. + +After playing around with the base model, you have 2 possible next steps. +* If you already have fine-tuned LoRAs placed inside `models/loras/`, please skip to `Step 7. Finetuned Model Inference` section. +* If you wish to train a LoRA for your custom concepts, first make sure that the ComfyUI inference container is brought down before proceeding to train. You can bring it by interrupting the terminal with `Ctrl+C` keystroke. + +> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server. +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' ``` -## Step 5. Train the model +## Step 5. Dataset Preparation -Inside the container, navigate to the sd-scripts directory and run the training script: +Let's prepare our dataset to perform Dreambooth LoRA finetuning on the FLUX.1-dev 12B model. However, if you wish to continue with the provided dataset of Toy Jensen and DGX Spark, feel free to skip to the [Training](#training) section. This dataset is a collection of public assets accessible via Google Images. + +You will need to prepare a dataset of all the concepts you would like to generate, and about 5-10 images for each concept. For this example, we would like to generate images with 2 concepts. + +**TJToy Concept** +- **Trigger phrase**: `tjtoy toy` +- **Training images**: 6 high-quality images of custom toy figures +- **Use case**: Generate images featuring the specific toy character in various scenes + +**SparkGPU Concept** +- **Trigger phrase**: `sparkgpu gpu` +- **Training images**: 7 images of custom GPU hardware +- **Use case**: Generate images featuring the specific GPU design in different contexts + +Create a folder for each concept with it's corresponding name, and place it inside the `flux_data` directory. In our case, we have used `sparkgpu` and `tjtoy` as our concepts, and placed a few images inside each of them. + +Now, let's modify the `flux_data/data.toml` file to reflect the concepts chosen. Ensure that you update/create entries for each of your concept, by modifying the `image_dir` and `class_tokens` fields under `[[datasets.subsets]]`. For better performance in finetuning, it is a good practice to append a class token to your concept name (like `toy` or `gpu`). + +## Step 6. Training + + Launch training by executing the follow command. The training script is setup to use a default configuration that can generate reasonable images for your dataset, in about ~90 mins of training. This train command will automatically store checkpoints in the `models/loras/` directory. ```bash -cd /workspace/sd-scripts -sh train.sh +## Build the inference docker image +docker build -f Dockerfile.train -t flux-train . + +## Trigger the training +sh launch_train.sh ``` -The training will: -- Use LoRA with dimension 256 -- Train for 100 epochs (saves every 25 epochs) -- Learn custom concepts: "tjtoy toy" and "sparkgpu gpu" -- Output trained LoRA weights to `saved_models/flux_dreambooth.safetensors` +## Step 7. Finetuned Model Inference -## Step 6. Generate images with command-line inference - -After training completes, generate sample images: +Now let's generate images using our finetuned LoRAs! ```bash -sh inference.sh +## Launch the ComfyUI container (ensure you are inside flux-finetuning/assets) +## You can ignore any import errors for `torchaudio` +sh launch_comfyui.sh ``` +Access ComfyUI at `http://localhost:8188` to generate images with the finetuned model. Do not select any pre-existing template. -This will generate several images demonstrating the learned concepts, stored in the `outputs` directory. +Find the workflow section on the left-side panel of ComfyUI (or press `w`). Upon opening it, you should find two existing workflows loaded up. For the finetuned Flux model, let's load the `finetuned_flux.json` workflow. After loading the json, you should see ComfyUI load up the workflow. -## Step 7. Spin up ComfyUI for visual workflows +Provide your prompt in the `CLIP Text Encode (Prompt)` block. Now let's incorporate our custom concepts into our prompt for the finetuned model. For example, we will use `tjtoy toy holding sparkgpu gpu in a datacenter`. You can expect the generation to take ~3 mins since it is compute intesive to create high-resolution 1024px images. -Build the Docker image for ComfyUI: - -```bash -## Build the Docker image (this will download FLUX models automatically) -docker build -f Dockerfile.inference --build-arg HF_TOKEN=$HF_TOKEN -t flux-comfyui . -``` - -Run the ComfyUI container: - -```bash -docker run -it \ - --gpus all \ - --ipc=host \ - --ulimit memlock=-1 \ - --ulimit stack=67108864 \ - --net=host \ - flux-comfyui -``` - -Start ComfyUI for an intuitive interface: - -```bash -cd /workspace/ComfyUI -python main.py -``` - -Access ComfyUI at `http://localhost:8188` - -## Step 8. Deploy the trained LoRA in ComfyUI - -Feel free to deploy the trained LoRA in ComfyUI in existing or custom workflows. -Use your trained concepts in prompts: -- `"tjtoy toy"` - Your custom toy concept -- `"sparkgpu gpu"` - Your custom GPU concept -- `"tjtoy toy holding sparkgpu gpu"` - Combined concepts - -## Step 9. Cleanup - -Exit the container and optionally remove the Docker image: - -```bash -## Exit container -exit - -## Remove Docker image (optional) -docker stop -docker rmi flux-training -``` - -## Step 10. Next steps - -- Experiment with different LoRA strengths (0.8-1.2) in ComfyUI -- Train on your own custom concepts by replacing images in the `data/` directory -- Combine multiple LoRA models for complex compositions -- Integrate the trained LoRA into other FLUX workflows +For the provided prompt and random seed, the finetuned Flux model generated the following image. Unlike the base model, we can see that the finetuned model can generate multiple concepts in a single image. Additionally,ComfyUI exposes several fields to tune and change the look and feel of the generated images. diff --git a/nvidia/vlm-finetuning/README.md b/nvidia/vlm-finetuning/README.md index a27e7aa..09aa5bf 100644 --- a/nvidia/vlm-finetuning/README.md +++ b/nvidia/vlm-finetuning/README.md @@ -6,7 +6,6 @@ - [Overview](#overview) - [Instructions](#instructions) - - [Video VLM Testing](#video-vlm-testing) --- @@ -15,13 +14,13 @@ ## Basic Idea This playbook demonstrates how to fine-tune Vision-Language Models (VLMs) for both image and video understanding tasks on DGX Spark. -With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training VRAM intensive multimodal models that can understand and reason about visual content. +With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training VRAM-intensive multimodal models that can understand and reason about visual content. The playbook covers two distinct VLM fine-tuning approaches: - **Image VLM Fine-tuning**: Using Qwen2.5-VL-7B for wildfire detection from satellite imagery with GRPO (Generalized Reward Preference Optimization) - **Video VLM Fine-tuning**: Using InternVL3 8B for dangerous driving detection and structured metadata generation from driving videos -Both approaches leverage advanced training techniques including LoRA fine-tuning, preference optimization, and structured reasoning to achieve superior performance on specialized tasks. +Both approaches leverage advanced training techniques, including LoRA fine-tuning, preference optimization, and structured reasoning to achieve superior performance on specialized tasks. ## What you'll accomplish @@ -50,12 +49,12 @@ The setup includes: - 1-2 hours for video VLM training (depending on video dataset size) **Risks**: -- Docker permission issues may require user group changes and session restart +- Docker permission issues may require user group changes and a session restart - Large model downloads and datasets may require significant disk space and time - Training requires sustained GPU usage and memory - Dataset preparation may require manual steps (Kaggle downloads, video processing) -**Rollback**: Stop and remove Docker containers, delete downloaded models and datasets if needed +**Rollback**: Stop and remove Docker containers, delete downloaded models and datasets if needed. ## Instructions @@ -91,65 +90,79 @@ git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx- Build the Docker image. This will set up the environment for both image and video VLM fine-tuning: ```bash -docker build -t vlm-finetuning . +## Enter the correct directory for building the image +cd vlm-finetuning/assets + +## Build the VLM fine-tuning container +docker build --build-arg HF_TOKEN=$HF_TOKEN -t vlm_demo . ``` ## Step 4. Run the Docker container ```bash -## Run with GPU support and mount current directory -docker run --gpus all -it --rm \ - -v $(pwd):/workspace \ - -p 8501:8501 \ - -p 8888:8888 \ - -p 6080:6080 \ - vlm-finetuning +## Run the container with GPU support +sh launch.sh + +## Enter the mounted directory within the container +cd /vlm_finetuning ``` +**Note**: The same Docker container and launch commands work for both image and video VLM recipes. The container includes all necessary dependencies, including FFmpeg, Decord, and optimized libraries for both workflows. ## Step 5. [Option A] For image VLM fine-tuning (Wildfire Detection) -#### 5.1. Set up Weights & Biases -Configure your wandb credentials for training monitoring: +#### 5.1. Model Download ```bash -export WANDB_ENTITY= -export WANDB_PROJECT="vlm_finetuning" -export WANDB_API_KEY= +hf download Qwen/Qwen2.5-VL-7B-Instruct ``` +If you already have a fine-tuned checkpoint, place it in the `saved_model/` folder. Note that your checkpoint number can be different. For a comparative analysis against the base model, skip directly to the `Finetuned Model Inference` section. + #### 5.2. Download the wildfire dataset from Kaggle and place it in the `data` directory The wildfire dataset can be found here: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset -#### 5.3. Launch the Image VLM UI +#### 5.3. Base Model Inference + +Before we start finetuning, let's spin up the demo UI to evaluate the base model's performance on this task. ```bash -cd ui_image -streamlit run Image_VLM_Finetuning.py +streamlit run Image_VLM.py ``` -Access the interface at `http://localhost:8501` +Access the streamlit demo at http://localhost:8501/. -#### 5.4. Configure and start training +When you access the streamlit demo for the first time, the backend triggers vLLM servers to spin up for the base model. You will see a spinner on the demo site as vLLM is being brought up for optimized inference. This step can take upto 15 mins. -- Configure training parameters through the web interface -- Choose fine-tuning method (LoRA, QLoRA, or Full-Finetuning) -- Set hyperparameters (epochs, batch size, learning rate) -- Click "▶️ Start Finetuning" to begin GRPO training -- Monitor progress via embedded wandb charts +#### 5.4. GRPO Finetuning -#### 5.5. Test the fine-tuned model +We will perform GRPO finetuning to add reasoning capabilities to our base model and improve the model's understanding of the underlying domain. Considering that you have already spun up the streamlit demo, scroll to the `GRPO Training section`. -After training completes: -1. Bring down the UI with Ctrl+C -2. Edit `src/image_vlm_config.yaml` and update `finetuned_model_id` to point to your model in `saved_model/` -3. Restart the interface to test your fine-tuned model +After configuring all the parameters, hit `Start Finetuning` to begin the training process. You will need to wait about 15 minutes for the model to load and start recording metadata on the UI. As the training progresses, information such as the loss, step, and GRPO rewards will be recorded on a live table. + +The default loaded configuration should give you reasonable accuracy, taking 100 steps of training over a period of upto 2 hours. We achieved our best accuracy with around 1000 steps of training, taking close to 16 hours. + +After training is complete, the script automatically merges LoRA weights into the base model. After the training process has reached the desired number of training steps, it can take 5 mins to merge the LoRA weights. + +Once you stop training, the UI will automatically bring up the vLLM servers for the base model and the newly finetuned model. + +#### 5.5. Finetuned Model Inference + +Now we are ready to perform a comparative analysis between the base model and the finetuned model. + +Regardless of whether you just spun up the demo or just stopped training, please wait about 15 minutes for the vLLM servers to be brought up. + +Scroll down to the `Image Inference` section, and enter your prompt in the provided chat box. Upon clicking `Generate`, your prompt would be first sent to the base model and then to the finetuned model. You can use the following prompt to quickly test inference + +`Identify if this region has been affected by a wildfire` + +If you trained your model sufficiently, you should see that the finetuned model is able to perform reasoning and provide a concise, accurate answer to the prompt. The reasoning steps are provided in the markdown format, while the final answer is bolded and provided at the end of the model's response. ## Step 6. [Option B] For video VLM fine-tuning (Driver Behaviour Analysis) #### 6.1. Prepare your video dataset -Structure your dataset as follows: +Structure your dataset as follows. Ensure that `metadata.jsonl` contains rows of structured JSON data about each video. ``` dataset/ ├── videos/ @@ -159,45 +172,68 @@ dataset/ └── metadata.jsonl ``` -#### 6.2. Start Jupyter Lab +#### 6.2. Model Download + +> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the main project README at `vlm-finetuning/assets/README.md`. ```bash -jupyter lab --ip=0.0.0.0 --port=8888 --allow-root +hf download OpenGVLab/InternVL3-8B ``` -Access Jupyter at `http://localhost:8888` +#### 6.4. Base Model Inference -#### 6.3. Run the training notebook +Before going ahead to finetune our video VLM for this task, let's see how the base InternVL3-8B does. ```bash -cd ui_video/train -## Open and run internvl3_dangerous_driving.ipynb -## Update dataset path in the notebook to point to your data +## cd into vlm_finetuning/assets/ui_video if you haven't already +streamlit run Video_VLM.py ``` -#### 6.4. Run inference +Access the streamlit demo at http://localhost:8501/. -### Video VLM Testing -- Use the inference notebook to test on dashcam footage videos -- Generate structured JSON metadata for dangerous driving events -- Analyze traffic violations and safety risks +When you access the streamlit demo for the first time, the backend triggers Huggingface to spin up the base model. You will see a spinner on the demo site as the model is being loaded, which can take upto 10 minutes. -## Step 7. Cleanup +First, let's select a video from our dashcam gallery. Upon clicking the green file open icon near a video, you should see the video render and play automatically for our reference. -Exit the container and optionally remove the Docker image: +If you are proceeding to train a finetuned model, ensure that the streamlit demo UI is brought down before proceeding to train. You can bring it up by interrupting the terminal with `Ctrl+C` keystroke. + +> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server. +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` + +#### 6.5. Run the training notebook ```bash -## Exit container -exit +## Enter the correct directory +cd /vlm-finetuning/ui_video/train -## Remove Docker image (optional) -docker stop -docker rmi vlm-finetuning +## Start Jupyter Lab +jupyter notebook video_vlm.ipynb +``` +Access Jupyter at `http://localhost:8888`. Ensure that you set the path to your dataset correctly in the appropriate cell. + +```python +dataset_path = "/path/to/your/dataset" +``` +After training, ensure that you shutdown the jupyter kernel in the notebook and kill the jupyter server in the terminal with a `Ctrl+C` keystroke. + +> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server. +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` +#### 6.6. Finetuned Model Inference + +Now we are ready to perform a comparative analysis between the base model and the finetuned model. + +If you haven't spun up the streamlit demo already, execute the following command. If you have just stopped training and are still within the live UI, skip to the next step. + +```bash +streamlit run Video_VLM.py ``` -## Step 8. Next steps +Access the streamlit demo at http://localhost:8501/. -- Train on your own custom datasets for specialized use cases -- Combine multiple VLM models for comprehensive multimodal analysis -- Explore other VLM architectures and training techniques -- Deploy fine-tuned models in production environments +If you trained your model sufficiently, you should see that the finetuned model is able to identify the salient events from the video and generate a structured output. + +Feel free to play around with additional videos available in the gallery.