chore: Regenerate all playbooks

2026-06-18 04:22:21 +00:00 · 2025-10-06 15:32:36 +00:00 · 2025-10-06 15:32:36 +00:00 · 0f5c77e06e
commit 0f5c77e06e
parent db351ceacc
14 changed files with 474 additions and 3219 deletions
--- a/nvidia/flux-finetuning/assets/README.md
+++ b/nvidia/flux-finetuning/assets/README.md
@ -98,7 +98,7 @@ After playing around with the base model, you have 2 possible next steps.
 * If you already have fine-tuned LoRAs placed inside `models/loras/`, please skip to [Load the finetuned workflow](#52-load-the-finetuned-workflow) section.
 * If you wish to train a LoRA for your custom concepts, first make sure that the ComfyUI inference container is brought down before proceeding to train. You can bring it by interrupting the terminal with `Ctrl+C` keystroke.

-> **Note**: To clear out any extra occupied memory from your system, execute the following command after interrupting the ComfyUI server.
+> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
 ```bash
 sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
 ```
--- a/nvidia/flux-finetuning/assets/launch_train.sh
+++ b/nvidia/flux-finetuning/assets/launch_train.sh
@ -38,7 +38,7 @@ CMD="accelerate launch \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing \
    --sdpa \
-    --max_train_epochs=25 \
+    --max_train_epochs=100 \
    --save_every_n_epochs=25 \
    --mixed_precision=bf16 \
    --guidance_scale=1.0 \
--- a/nvidia/pytorch-fine-tune/README.md
+++ b/nvidia/pytorch-fine-tune/README.md
@ -13,74 +13,101 @@

 ## Basic Idea

-This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.
+This playbook guides you through setting up and using Pytorch for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.

 ## What you'll accomplish

-You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT)
+You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem.
+
 ## What to know before starting



 ## Prerequisites
-recipes are specifically for DIGITS SPARK. Please make sure that OS and drivers are latest.
+


 ## Ancillary files

-ALl files required for finetuning are included.
+

 ## Time & risk

-**Time estimate:** 30-45 mins for setup and runing finetuning. Finetuning run time varies depending on model size 
+**Time estimate:** 

-**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
+**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations

 **Rollback:**

 ## Instructions

-## Step 1.  Pull the latest Pytorch container
+## Step 1. Verify system requirements
+
+Check your NVIDIA Spark device meets the prerequisites for NeMo AutoModel installation. This step runs on the host system to confirm CUDA toolkit availability and Python version compatibility.

 ```bash
-docker pull nvcr.io/nvidia/pytorch:25.09-py3
+## Verify CUDA installation
+nvcc --version
+
+## Verify GPU accessibility
+nvidia-smi
+
+## Check available system memory
+free -h
 ```

-## Step 2. Launch Docker
+## Step 2. Get the container image

 ```bash
-docker run --gpus all -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}:/workspace -w /workspace \
-nvcr.io/nvidia/pytorch:25.09-py3
-
+docker pull nvcr.io/nvidia/pytorch:25.08-py3
 ```

-## Step 3. Install dependencies inside the contianer
+## Step 3. Launch Docker

 ```bash
-pip install transformers peft datasets "trl==0.19.1" "bitsandbytes==0.48"
+docker run \
+  --gpus all \
+  --ulimit memlock=-1 \
+  -it --ulimit stack=67108864 \
+  --entrypoint /usr/bin/bash \
+  --rm nvcr.io/nvidia/pytorch:25.08-py3
 ```

-## Step 4: authenticate with huggingface
+
+
+
+
+## Step 10. Troubleshooting
+
+Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
+
+| Symptom | Cause | Fix |
+|---------|--------|-----|
+| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
+| `pip install uv` permission denied | System-level pip restrictions | Use `pip3 install --user uv` and update PATH |
+| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
+| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
+| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
+
+## Step 11. Cleanup and rollback
+
+Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
+
+> **Warning:** This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.

 ```bash
-huggingface-cli login
-##<input your huggingface token.
-##<Enter n for git credential>
+## Remove virtual environment
+rm -rf .venv

-```
-To run LoRA on Llama3 use the following command:
+## Remove cloned repository
+cd ..
+rm -rf Automodel

-```bash
-python Llama3_8B_LoRA_finetuning.py
+## Remove uv (if installed with --user)
+pip3 uninstall uv
+
+## Clear Python cache
+rm -rf ~/.cache/pip
 ```

-To run qLoRA finetuning on llama3-70B use the following command:
-```bash
-python Llama3_70B_qLoRA_finetuning.py
-```
-To run full finetuning on llama3-3B use the following command:
-```bash
-python Llama3_3B_full_finetuning.py
-```
+## Step 12. Next steps
--- a/nvidia/speculative-decoding/README.md
+++ b/nvidia/speculative-decoding/README.md
@ -6,10 +6,9 @@

 - [Overview](#overview)
 - [How to run inference with speculative decoding](#how-to-run-inference-with-speculative-decoding)
-  - [Step 1. Run Eagle3 with GPT-OSS 120B](#step-1-run-eagle3-with-gpt-oss-120b)
-  - [Step 2. Test the Eagle3 setup](#step-2-test-the-eagle3-setup)
-  - [Step 1. Run Draft-Target Speculative Decoding](#step-1-run-draft-target-speculative-decoding)
-  - [Step 2. Test the Draft-Target setup](#step-2-test-the-draft-target-setup)
+  - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
+  - [Step 2. Run Draft-Target Speculative Decoding](#step-2-run-draft-target-speculative-decoding)
+  - [Step 3. Test the Draft-Target setup](#step-3-test-the-draft-target-setup)
  - [Troubleshooting](#troubleshooting)
  - [Cleanup](#cleanup)
  - [Next Steps](#next-steps)
@ -21,31 +20,28 @@
 ## Basic idea

 Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them.
-This way, the big model doesn’t need to predict every token step-by-step, reducing latency while keeping output quality.
+This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.

 ## What you'll accomplish

-You'll explore two different speculative decoding approaches using TensorRT-LLM on NVIDIA Spark:
-1. **Eagle3 with GPT-OSS 120B** - Advanced speculative decoding using Eagle3 draft models
-2. **Traditional Draft-Target** - Classic speculative decoding with smaller model pairs (coming soon)
+You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.

 These examples demonstrate how to accelerate large language model inference while maintaining output quality.

 ## What to know before starting

 - Experience with Docker and containerized applications
- Understanding of speculative decoding concepts (Eagle3 vs traditional draft-target)
+- Understanding of speculative decoding concepts
 - Familiarity with TensorRT-LLM serving and API endpoints
 - Knowledge of GPU memory management for large language models

 ## Prerequisites

- NVIDIA Spark device with sufficient GPU memory available (80GB+ recommended for GPT-OSS 120B)
+- NVIDIA Spark device with sufficient GPU memory available
 - Docker with GPU support enabled
  ```bash
  docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev nvidia-smi
  ```
- Access to NVIDIA's internal container registry (for Eagle3 example)
 - HuggingFace authentication configured (if needed for model downloads)
  ```bash
  huggingface-cli login
@ -55,7 +51,7 @@ These examples demonstrate how to accelerate large language model inference whil

 ## Time & risk

-**Duration:** 10-20 minutes for Eagle3 setup, additional time for model downloads (varies by network speed)
+**Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)

 **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads

@ -63,66 +59,30 @@ These examples demonstrate how to accelerate large language model inference whil

 ## How to run inference with speculative decoding

-## Example 1: Eagle3 Speculative Decoding with GPT-OSS 120B
-
-Eagle3 is an advanced speculative decoding technique that uses a specialized draft model to accelerate inference of large language models.
-
-### Step 1. Run Eagle3 with GPT-OSS 120B
-
-Execute the following command to download models and run Eagle3 speculative decoding:
-
-```bash
-docker run \
-  -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
-  --rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
-  --gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
-  bash -c '
-    hf download openai/gpt-oss-120b && \
-    hf download nvidia/gpt-oss-120b-Eagle3 \
-        --local-dir /opt/gpt-oss-120b-Eagle3/ && \
-    cat > /tmp/extra-llm-api-config.yml <<EOF
-enable_attention_dp: false
-disable_overlap_scheduler: true
-enable_autotuner: false
-cuda_graph_config:
-    max_batch_size: 1
-speculative_config:
-    decoding_type: Eagle
-    max_draft_len: 4
-    speculative_model_dir: /opt/gpt-oss-120b-Eagle3/
-
-kv_cache_config:
-    enable_block_reuse: false
-EOF
-    trtllm-serve openai/gpt-oss-120b \
-      --backend pytorch --tp_size 1 \
-      --max_batch_size 1 \
-      --kv_cache_free_gpu_memory_fraction 0.95 \
-      --extra_llm_api_options /tmp/extra-llm-api-config.yml'
-```
-
-### Step 2. Test the Eagle3 setup
-
-Once the server is running, you can test it with curl commands:
-
-```bash
-## Test completion endpoint
-curl -X POST http://localhost:8000/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "openai/gpt-oss-120b",
-    "prompt": "The future of AI is",
-    "max_tokens": 100,
-    "temperature": 0.7
-  }'
-```
-
-
-## Example 2: Traditional Draft-Target Speculative Decoding
+## Traditional Draft-Target Speculative Decoding

 This example demonstrates traditional speculative decoding using a smaller draft model to accelerate a larger target model.

-### Step 1. Run Draft-Target Speculative Decoding
+### Step 1. Configure Docker permissions
+
+To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
+
+Open a new terminal and test Docker access. In the terminal, run:
+
+```bash
+docker ps
+```
+
+If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
+
+```bash
+sudo usermod -aG docker $USER
+```
+
+> **Warning**: After running usermod, you must log out and log back in to start a new
+> session with updated group permissions.
+
+### Step 2. Run Draft-Target Speculative Decoding

 Execute the following command to set up and run traditional speculative decoding:

@ -158,9 +118,9 @@ EOF
  "
 ```

-### Step 2. Test the Draft-Target setup
+### Step 3. Test the Draft-Target setup

-Once the server is running, test it with API calls:
+Once the server is running, test it by making an API call from another terminal:

 ```bash
 ## Test completion endpoint
@ -206,9 +166,6 @@ docker stop <container_id>

 ### Next Steps

- Compare both Eagle3 and Draft-Target performance with baseline inference
- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8) for both approaches
- Monitor token acceptance rates and throughput improvements across different model pairs
+- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
+- Monitor token acceptance rates and throughput improvements
 - Test with different prompt lengths and generation parameters
- Compare Eagle3 vs Draft-Target approaches for your specific use case
- Benchmark memory usage differences between the two methods
--- a/nvidia/trt-llm/README.md
+++ b/nvidia/trt-llm/README.md
@ -6,17 +6,18 @@

 - [Overview](#overview)
 - [Single Spark](#single-spark)
-  - [Step 1. Verify environment prerequisites](#step-1-verify-environment-prerequisites)
-  - [Step 2. Set environment variables](#step-2-set-environment-variables)
-  - [Step 3. Validate TensorRT-LLM installation](#step-3-validate-tensorrt-llm-installation)
-  - [Step 4. Create cache directory](#step-4-create-cache-directory)
-  - [Step 5. Validate setup with quickstart_advanced](#step-5-validate-setup-with-quickstartadvanced)
+  - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
+  - [Step 2. Verify environment prerequisites](#step-2-verify-environment-prerequisites)
+  - [Step 3. Set environment variables](#step-3-set-environment-variables)
+  - [Step 4. Validate TensorRT-LLM installation](#step-4-validate-tensorrt-llm-installation)
+  - [Step 5. Create cache directory](#step-5-create-cache-directory)
+  - [Step 6. Validate setup with quickstart_advanced](#step-6-validate-setup-with-quickstartadvanced)
  - [LLM quickstart example](#llm-quickstart-example)
-  - [Step 6. Validate setup with quickstart_multimodal](#step-6-validate-setup-with-quickstartmultimodal)
+  - [Step 7. Validate setup with quickstart_multimodal](#step-7-validate-setup-with-quickstartmultimodal)
  - [VLM quickstart example](#vlm-quickstart-example)
-  - [Step 7. Serve LLM with OpenAI-compatible API](#step-7-serve-llm-with-openai-compatible-api)
-  - [Step 8. Troubleshooting](#step-8-troubleshooting)
-  - [Step 9. Cleanup and rollback](#step-9-cleanup-and-rollback)
+  - [Step 8. Serve LLM with OpenAI-compatible API](#step-8-serve-llm-with-openai-compatible-api)
+  - [Step 9. Troubleshooting](#step-9-troubleshooting)
+  - [Step 10. Cleanup and rollback](#step-10-cleanup-and-rollback)
 - [Run on two Sparks](#run-on-two-sparks)
  - [Step 1. Review Spark clustering documentation](#step-1-review-spark-clustering-documentation)
  - [Step 2. Verify connectivity and SSH setup](#step-2-verify-connectivity-and-ssh-setup)
@ -68,6 +69,8 @@ The following models are supported with TensorRT-LLM on Spark. All listed models

 | Model | Quantization | Support Status | HF Handle |
 |-------|-------------|----------------|-----------|
+| **GPT-OSS-20B** | MXFP4 | ✅ | `openai/gpt-oss-20b` |
+| **GPT-OSS-120B** | MXFP4 | ✅ | `openai/gpt-oss-120b` |
 | **Llama-3.1-8B-Instruct** | FP8 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP8` |
 | **Llama-3.1-8B-Instruct** | NVFP4 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP4` |
 | **Llama-3.3-70B-Instruct** | NVFP4 | ✅ | `nvidia/Llama-3.3-70B-Instruct-FP4` |
@ -96,7 +99,27 @@ The following models are supported with TensorRT-LLM on Spark. All listed models

 ## Single Spark

-### Step 1. Verify environment prerequisites
+### Step 1. Configure Docker permissions
+
+To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
+
+Open a new terminal and test Docker access. In the terminal, run:
+
+```bash
+docker ps
+```
+
+If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
+
+```bash
+sudo usermod -aG docker $USER
+```
+
+> **Warning**: After running usermod, you must log out and log back in to start a new
+> session with updated group permissions.
+
+
+### Step 2. Verify environment prerequisites

 Confirm your Spark device has the required GPU access and network connectivity for downloading
 models and containers.
@ -110,7 +133,7 @@ docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-

 ```

-### Step 2. Set environment variables
+### Step 3. Set environment variables

 Set `HF_TOKEN` for model access.

@ -118,7 +141,7 @@ Set `HF_TOKEN` for model access.
 export HF_TOKEN=<your-huggingface-token>
 ```

-### Step 3. Validate TensorRT-LLM installation
+### Step 4. Validate TensorRT-LLM installation

 After confirming GPU access, verify that TensorRT-LLM can be imported inside the container.

@ -134,7 +157,7 @@ Expected output:
 TensorRT-LLM version: 1.1.0rc3
 ```

-### Step 4. Create cache directory
+### Step 5. Create cache directory

 Set up local caching to avoid re-downloading models on subsequent runs.

@ -143,7 +166,7 @@ Set up local caching to avoid re-downloading models on subsequent runs.
 mkdir -p $HOME/.cache/huggingface/
 ```

-### Step 5. Validate setup with quickstart_advanced
+### Step 6. Validate setup with quickstart_advanced

 This quickstart validates your TensorRT-LLM setup end-to-end by testing model loading, inference engine initialization, and GPU execution with real text generation. It's the fastest way to confirm everything works before starting the inference API server.

@ -181,6 +204,10 @@ docker run \
  --gpus=all --ipc=host --network host \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  bash -c '
+    export TIKTOKEN_ENCODINGS_BASE="/tmp/harmony-reqs" && \
+    mkdir -p $TIKTOKEN_ENCODINGS_BASE && \
+    wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken && \
+    wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken && \
    hf download $MODEL_HANDLE && \
    python examples/llm-api/quickstart_advanced.py \
      --model_dir $MODEL_HANDLE \
@ -201,6 +228,10 @@ docker run \
  --gpus=all --ipc=host --network host \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  bash -c '
+    export TIKTOKEN_ENCODINGS_BASE="/tmp/harmony-reqs" && \
+    mkdir -p $TIKTOKEN_ENCODINGS_BASE && \
+    wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken && \
+    wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken && \
    hf download $MODEL_HANDLE && \
    python examples/llm-api/quickstart_advanced.py \
      --model_dir $MODEL_HANDLE \
@ -208,7 +239,7 @@ docker run \
      --max_tokens 64
    '
 ```
-### Step 6. Validate setup with quickstart_multimodal
+### Step 7. Validate setup with quickstart_multimodal

 ### VLM quickstart example

@ -266,10 +297,11 @@ docker run \
 sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
 ```

-### Step 7. Serve LLM with OpenAI-compatible API
+### Step 8. Serve LLM with OpenAI-compatible API

 Serve with OpenAI-compatible API via trtllm-serve:

+#### Llama 3.1 8B Instruct
 ```bash
 export MODEL_HANDLE="nvidia/Llama-3.1-8B-Instruct-FP4"

@ -283,7 +315,39 @@ docker run --name trtllm_llm_server --rm -it --gpus all --ipc host --network hos
    cat > /tmp/extra-llm-api-config.yml <<EOF
 print_iter_log: false
 kv_cache_config:
-  dtype: "fp8"
+  dtype: "auto"
+  free_gpu_memory_fraction: 0.9
+cuda_graph_config:
+  enable_padding: true
+disable_overlap_scheduler: true
+EOF
+    trtllm-serve "$MODEL_HANDLE" \
+      --max_batch_size 64 \
+      --trust_remote_code \
+      --port 8355 \
+      --extra_llm_api_options /tmp/extra-llm-api-config.yml
+  '
+```
+
+#### GPT-OSS 20B
+```bash
+export MODEL_HANDLE="openai/gpt-oss-20b"
+
+docker run --name trtllm_llm_server --rm -it --gpus all --ipc host --network host \
+  -e HF_TOKEN=$HF_TOKEN \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
+  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
+  bash -c '
+    export TIKTOKEN_ENCODINGS_BASE="/tmp/harmony-reqs" && \
+    mkdir -p $TIKTOKEN_ENCODINGS_BASE && \
+    wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken && \
+    wget -P $TIKTOKEN_ENCODINGS_BASE https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken && \
+    hf download $MODEL_HANDLE && \
+    cat > /tmp/extra-llm-api-config.yml <<EOF
+print_iter_log: false
+kv_cache_config:
+  dtype: "auto"
  free_gpu_memory_fraction: 0.9
 cuda_graph_config:
  enable_padding: true
@ -309,7 +373,7 @@ curl -s http://localhost:8355/v1/chat/completions \
  }'
 ```

-### Step 8. Troubleshooting
+### Step 9. Troubleshooting

 Common issues and their solutions:

@ -321,7 +385,7 @@ Common issues and their solutions:
 | Container pull timeout | Network connectivity issues | Retry pull or use local mirror |
 | Import tensorrt_llm fails | Container runtime issues | Restart Docker daemon and retry |

-### Step 9. Cleanup and rollback
+### Step 10. Cleanup and rollback

 Remove downloaded models and containers to free up space when testing is complete.

@ -454,7 +518,7 @@ export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode)
 docker exec $TRTLLM_MN_CONTAINER bash -c 'cat <<EOF > /tmp/extra-llm-api-config.yml
 print_iter_log: false
 kv_cache_config:
-  dtype: "fp8"
+  dtype: "auto"
  free_gpu_memory_fraction: 0.9
 cuda_graph_config:
  enable_padding: true
--- a/nvidia/vlm-finetuning/assets/ui_image/README.md
+++ b/nvidia/vlm-finetuning/assets/ui_image/README.md
@ -29,9 +29,25 @@ hf download Qwen/Qwen2.5-VL-7B-Instruct

 ### 1.2 (Optional) Download the fine-tuned model

-If you already have a fine-tuned checkpoint, place it in the `saved_model/` folder. 
+If you already have a fine-tuned checkpoint, place it in the `saved_model/` folder. Your directory structure should look something like this. Note that your checkpoint number can be different.

-# TODO: SHOW TREE AND SKIP TO INFERENCE
+```
+saved_model/
+└── checkpoint-3/
+    ├── config.json
+    ├── generation_config.json
+    ├── model.safetensors.index.json
+    ├── model-00001-of-00004.safetensors
+    ├── model-00002-of-00004.safetensors
+    ├── model-00003-of-00004.safetensors
+    ├── model-00004-of-00004.safetensors
+    ├── preprocessor_config.json
+    ├── special_tokens_map.json
+    ├── tokenizer_config.json
+    ├── tokenizer.json
+    ├── merges.txt
+    └── vocab.json
+```

 If you already have a finetuned checkpoint that you would like to just use for a comparative analysis against the base model, skip directly to the [Finetuned Model Inference](#5-finetuned-model-inference) section.

@ -119,7 +135,7 @@ You can additionally choose whether the layers you want to finetune in the VLM.

 In this section, we can select certain model parameters as relevant to our training run.

- `Epochs`: 1-100
+- `Steps`: 1-1000
 - `Batch Size`: 1, 2, 4, 8, or 16
 - `Learning Rate`: 1e-6 to 1e-2
 - `Optimizer`: AdamW or Adafactor
@ -134,11 +150,15 @@ For a GRPO setup, we also have the flexibility in choosing the reward that is as

 ### 4.4 Start training

-After configuring all the parameters, hit `Start Finetuning` to begin the training process. You will need to wait about 15 mins for the model to load and start recording metadata on the UI. As the training progresses, information such as the loss, epoch and GRPO rewards will be recorded on a live table.
+After configuring all the parameters, hit `Start Finetuning` to begin the training process. You will need to wait about 15 mins for the model to load and start recording metadata on the UI. As the training progresses, information such as the loss, step, and GRPO rewards will be recorded on a live table.
+
+The default loaded configuration should give you reasonable accuracy, taking 100 steps of training over a period of upto 2 hours. We achieved our best accuracies with around 1000 steps of training, taking close to 16 hours.
+
+After training is complete, the script automatically merges lora weights into the base model. After the training process has reached the desired number of training steps, it can take 5 mins to merge the lora weights.

 ### 4.5 Stop training

-If you wish to stop training, just hit the `Stop Finetuning` button. Ensure that you stop the training with atleast 50 steps complete to ensure that a finetuned checkpoint is stored.
+If you wish to stop training, just hit the `Stop Finetuning` button. Please use this button only to interrupt training. This button does not guarantee that the checkpoints will be properly stored or merged with lora adapter layers.

 Once you stop training, the UI will automatically bring up the vLLM servers for the base model and the newly finetuned model.

@ -168,7 +188,7 @@ Scroll down to the `Image Inference` section, and enter your prompt in the provi

 If you trained your model sufficiently enough, you should see that the finetuned model is able to perform reasoning and provide a concise, accurate answer to the prompt. The reasoning steps are provided in the markdown format, while the final answer is bolded and provided at the end of the model's response.

-For the image shown below, we have trained the model for 1000 steps, which took about 4 hours. 
+For the image shown below, we have trained the model for 1000 steps, which took about 16 hours. 

 ### 5.4 Further analysis

@ -192,3 +212,18 @@ ui_image/
 │   └── inference_screenshot.png # UI demonstration screenshot
 └── saved_model/                 # Training checkpoints directory (update config to point here)
 ```
+
+## Troubleshooting
+
+If you are facing VRAM issues where the model fails to load or offloads to cpu/meta device, ensure you bring down all docker containers and flush out dangling memory.
+
+```bash
+docker ps
+
+docker rm <CONTAINER_ID_1>
+docker rm <CONTAINER_ID_2>
+
+docker system prune
+
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```
--- a/nvidia/vlm-finetuning/assets/ui_image/src/train_image_vlm.py
+++ b/nvidia/vlm-finetuning/assets/ui_image/src/train_image_vlm.py
@ -177,7 +177,8 @@ def start_train(config):
        max_prompt_length=config["model"]["max_seq_length"],
        max_completion_length=config["model"]["max_seq_length"],
        max_steps=config["hyperparameters"]["steps"],
-        save_steps=3,
+        save_steps=5,
+        save_total_limit=2,
        max_grad_norm=0.1,
        report_to="none",
        output_dir=config["hyperparameters"]["output_dir"],
--- a/nvidia/vlm-finetuning/assets/ui_image/training_logs.txt
+++ b/nvidia/vlm-finetuning/assets/ui_image/training_logs.txt
@ -1,84 +0,0 @@
-/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
-  import pynvml  # type: ignore[import]
-🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
-🦥 Unsloth Zoo will now patch everything to make training faster!
-==((====))==  Unsloth 2025.9.11: Fast Qwen2_5_Vl patching. Transformers: 4.56.2.
-   \\   /|    NVIDIA GB10. Num GPUs = 1. Max memory: 119.699 GB. Platform: Linux.
-O^O/ \_/ \    Torch: 2.9.0a0+50eac811a6.nv25.09. CUDA: 12.1. CUDA Toolkit: 13.0. Triton: 3.4.0
-\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33+5146f2a.d20251005. FA2 = True]
- "-____-"     Free license: http://github.com/unslothai/unsloth
-Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
-Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
-
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:29<01:28, 29.57s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:58<00:58, 29.48s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [01:27<00:28, 28.99s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:36<00:00, 21.25s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:36<00:00, 24.20s/it]
-Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
-Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
-The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
-==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
-   \\   /|    Num examples = 30,250 | Num Epochs = 1 | Total steps = 30,250
-O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
-\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
- "-____-"     Trainable parameters = 103,043,072 of 8,395,209,728 (1.23% trained)
-
  0%|          | 0/30250 [00:00<?, ?it/s]`generation_config` default values have been modified to match model-specific defaults: {'max_length': 32768, 'temperature': 1e-06, 'repetition_penalty': 1.05}. If this is not desired, please set these values explicitly.
----------------------------------
-Question:
-<|im_start|>system
-You are a helpful assistant.<|im_end|>
-<|im_start|>user
-<|vision_start|><|image_pad|><|vision_end|>Identify if this region has been affected by a wildfire. Also first provide your reasoning or working out on how you would go about identifying the presence of wildfire affected regions between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a simple Yes or No here) </SOLUTION><|im_end|>
-<|im_start|>assistant
- 
-Answer:
-No 
-Response:
-<REASONING>
-To identify if the region has been affected by a wildfire, we need to look for signs such as charred areas, absence of vegetation, or changes in the landscape that might indicate fire activity. In satellite imagery, these signs can be subtle but may include:
-
-1. **Burnt Vegetation**: Areas where the vegetation appears to have been consumed by fire, often appearing as darkened patches.
-2. **Ash or Ash Patches**: These can appear as lighter-colored areas compared to surrounding areas.
-3. **Changes in Land Cover**: The presence of new structures or changes in the type of land cover that could indicate recent construction or clearing.
-4. **Smoke Plumes**: While not visible in this still image, smoke plumes from wildfires can sometimes be seen in satellite imagery.
-
-In this image:
- There are no obvious darkened or charred areas that would suggest recent fire activity.
- The vegetation appears intact, with no signs of ash or burnt areas.
- There are no visible smoke plumes or signs of recent construction that would indicate recent fire activity.
-
-</REASONING>
-
-<SOLUTION>
-No
-</SOLUTION>
-
  0%|          | 1/30250 [01:15<631:36:01, 75.17s/it]
                                                     
{'loss': 0.0, 'grad_norm': 3.589693005778827e-05, 'learning_rate': 0.0, 'num_tokens': 879.0, 'completions/mean_length': 205.5, 'completions/min_length': 177.0, 'completions/max_length': 234.0, 'completions/clipped_ratio': 0.0, 'completions/mean_terminated_length': 205.5, 'completions/min_terminated_length': 177.0, 'completions/max_terminated_length': 234.0, 'rewards/format_reward_func/mean': 2.0, 'rewards/format_reward_func/std': 0.0, 'rewards/correctness_reward_func/mean': 5.0, 'rewards/correctness_reward_func/std': 0.0, 'reward': 7.0, 'reward_std': 0.0, 'frac_reward_zero_std': 1.0, 'completion_length': 205.5, 'kl': 0.0005636619171127677, 'epoch': 0.0}
-
  0%|          | 1/30250 [01:15<631:36:01, 75.17s/it]----------------------------------
-Question:
-<|im_start|>system
-You are a helpful assistant.<|im_end|>
-<|im_start|>user
-<|vision_start|><|image_pad|><|vision_end|>Identify if this region has been affected by a wildfire. Also first provide your reasoning or working out on how you would go about identifying the presence of wildfire affected regions between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a simple Yes or No here) </SOLUTION><|im_end|>
-<|im_start|>assistant
- 
-Answer:
-Yes 
-Response:
-<REASONING>
-To identify if a region has been affected by a wildfire, one would typically look for signs such as charred areas, burn scars, or changes in vegetation color and texture. In satellite imagery, these changes can be subtle but often noticeable.
-
-1. **Burn Scars**: These are areas where the ground has been scorched, often appearing as lighter or different shades compared to surrounding areas.
-2. **Vegetation Changes**: Wildfires can alter the appearance of vegetation, making it appear more brown or less dense than before.
-3. **Ash and Debris**: Ash and debris from a fire can leave a distinct mark on the landscape, which might not be immediately obvious but can be seen with close inspection.
-
-In the provided image:
- The area appears to have a mix of green and brown tones, which could indicate recent vegetation change.
- There is a distinct darker area that could be interpreted as a burn scar, but without more detailed information or higher resolution imagery, it's difficult to confirm.
- The overall pattern suggests a mix of forested and open areas, which is typical for many regions.
-
-However, without more specific indicators like smoke plumes, ash deposits, or detailed post-fire analysis, it's challenging to definitively conclude the presence of a wildfire.
-
-</REASONING>
-
-<SOLUTION>
-No
-</SOLUTION>
-Unsloth: Will smartly offload gradients to save VRAM!
-
  0%|          | 2/30250 [02:21<587:04:23, 69.87s/it]
                                                     
{'loss': 0.0, 'grad_norm': 3.785433727898635e-05, 'learning_rate': 3.3057851239669425e-09, 'num_tokens': 1864.0, 'completions/mean_length': 258.5, 'completions/min_length': 242.0, 'completions/max_length': 275.0, 'completions/clipped_ratio': 0.0, 'completions/mean_terminated_length': 258.5, 'completions/min_terminated_length': 242.0, 'completions/max_terminated_length': 275.0, 'rewards/format_reward_func/mean': 2.0, 'rewards/format_reward_func/std': 0.0, 'rewards/correctness_reward_func/mean': 0.0, 'rewards/correctness_reward_func/std': 0.0, 'reward': 2.0, 'reward_std': 0.0, 'frac_reward_zero_std': 1.0, 'completion_length': 258.5, 'kl': 0.0006416599499061704, 'epoch': 0.0}
-
  0%|          | 2/30250 [02:21<587:04:23, 69.87s/it]
--- a/nvidia/vlm-finetuning/assets/ui_video/README.md
+++ b/nvidia/vlm-finetuning/assets/ui_video/README.md
@ -1,34 +1,35 @@
 # Video VLM Fine-tuning with InternVL3

-This project demonstrates fine-tuning the InternVL3 model for video analysis, specifically for dangerous driving detection and structured metadata generation from driving videos.
+This project builds on top of the image VLM fine-tuning recipe to extend to the video modality. The notebook demonstrates how to fine-tuning the InternVL3 model for domain specific video analysis. For this prototype example, we have used driving dashcam footage from the [Nexar Scap Dataset](nexar-ai/nexar_collision_prediction) dataset to generate structured data which will be used for fine-tuning. 

 ## Workflow Overview

-![Training Workflow](assets/training_video.png)
+<figure>
+  <img src="assets/training_video.png" alt="Workflow Overview" width="1000"/>
+  <figcaption>Video VLM fine-tuning Workflow Overview</figcaption>
+</figure>

-### Training Workflow Steps:
+A typical workflow for a video fine-tuning includes the following:
+1. **Data Collection**: Collect raw footage/videos for a domain specific task. If the videos are very long, chunck them into reasonable sized files, for instance 5 sec duration. 
+2. **Generate Structured caption**: Collect structured caption for each video either using human generate dlabels or a larger VLM. 
+3. **Train InternVL3 Model**: Perform Supervised Finetuning on InternVL3-8B to extract structured metadata
+4. **Inference**: The fine-tuned model is noe ready for analysing domain specific videos. 

-1. **🎥 Dashcam Footage**: Dashcam footage from the Nexar Collision Prediction dataset
-2. **Generate Structed caption**: Leverage a very large VLM (InternVL3-78B) to generate structured captions from raw videos
-3. **🧠 Train InternVL3 Model**: Perform Supervised Finetuning on InternVL3-8B to extract structured metadata
-4. **🚀 Fine-tuned VLM**: Trained model ready for analysing driver behaviour and risk factors
+## Contents
+1. [Dataset Preparation](#2-dataset-preparation)
+2. [Model Download](#2-model-download)
+3. [Base Model Inference](#3-base-model-inference)
+4. [SFT Finetuning](#4-sft-finetuning)
+5. [Finetuned Model Inference](#5-finetuned-model-inference)

+## 1. Dataset Preparation

-## Training
+### 1.1 Data Source
+Identify a video data source which would benefit from structured data analysis. The videos can be either live footage or shorter video clips. In our case, we have chosen the [Nexar Scap Dataset](nexar-ai/nexar_collision_prediction). 

-### Data Requirements
+### 1.2 Caption Schema
+Based on the structured metedata that you would like to analyze from your video dataset, come up with a caption schema that can concisely capture your requirements. In our case, we have used the following schema. 

-Your dataset should be structured as follows:
-```
-dataset/
-├── videos/
-│   ├── video1.mp4
-│   ├── video2.mp4
-│   └── ...
-└── metadata.jsonl  # Contains video paths and labels
-```
-
-Each line in `metadata.jsonl` should contain:
 ```json
 {
    "video": "videos/video1.mp4",
@ -41,72 +42,186 @@ Each line in `metadata.jsonl` should contain:
 }
 ```

-### Running Training
+### 1.3 Caption Generation
+With the cpation schema decided, we must now generate groundtruth, structured caption for all our videos. This can be achieved either by leveraging a larger VLM for AI-assisted annotation or human labellers to manually caption. 

-1. **Update Dataset Path**: Edit the training notebook to point to your dataset:
-   ```python
-   dataset_path = "/path/to/your/dataset"
-   ```
+### 1.4 Dataset structure 

-2. **Run Training Notebook**:
-   ```bash
-   # Inside the container, navigate to the training directory
-   cd ui_video/train
-   jupyter notebook video_vlm.ipynb
-   ```
+```
+# Enter the correct directory
+cd ui_video
+```

-3. **Monitor Training**: Training progress and metrics are displayed directly in the notebook interface.
+Place all your videos in `dataset/videos`. Additionally, the captions should be placed inside the `metadata.jsonl`.

-### Training Configuration
+Your dataset should be structured as follows:
+```
+dataset/
+├── videos/
+│   ├── video1.mp4
+│   ├── video2.mp4
+│   └── ...
+└── metadata.jsonl
+```
+
+Your `metadata.jsonl` should look like this.
+
+```
+{"video": ..., "caption": ..., "event_type": ...}
+{"video": ..., "caption": ..., "event_type": ...}
+{"video": ..., "caption": ..., "event_type": ...}
+```
+
+## 2. Model Download
+
+> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the main project README at `vlm-finetuning/assets/README.md`.
+
+### 2.1 Download the pre-trained model
+
+```bash
+hf download OpenGVLab/InternVL3-8B
+```
+
+### 2.2 (Optional) Download the fine-tuned model
+
+If you already have a fine-tuned checkpoint, place it in the `saved_model/` folder. Your directory structure should look something like this. Note that your checkpoint number can be different.
+
+```
+saved_model/
+└── checkpoint-3/
+    ├── config.json
+    ├── generation_config.json
+    ├── model.safetensors.index.json
+    ├── model-00001-of-00004.safetensors
+    ├── model-00002-of-00004.safetensors
+    ├── model-00003-of-00004.safetensors
+    ├── model-00004-of-00004.safetensors
+    ├── preprocessor_config.json
+    ├── special_tokens_map.json
+    ├── tokenizer_config.json
+    ├── tokenizer.json
+    ├── merges.txt
+    └── vocab.json
+```
+
+If you already have a finetuned checkpoint that you would like to just use for a comparative analysis against the base model, skip directly to the [Finetuned Model Inference](#5-finetuned-model-inference) section.
+
+## 3. Base Model Inference
+
+Before going ahead to finetune our video VLM for this task, let's see how the base InternVL3-8B does.
+
+### 3.1 Spin up the Streamlit demo
+
+```bash
+# cd into vlm_finetuning/assets/ui_video if you haven't already
+streamlit run Video_VLM.py
+```
+
+Access the streamlit demo at http://localhost:8501/.
+
+### 3.2 Wait for demo spin-up
+
+When you access the streamlit demo for the first time, the backend triggers Huggingface to spin up the base model. You will see a spinner on the demo site as the model is being loaded, which can take upto 10 mins.
+
+### 3.3 Run base model inference
+
+First, let's select a video from our dashcam gallery. Upon clicking the green file open icon near a video, you should see the video render and play automatically for our reference.
+
+Scroll down, enter your prompt in the chat box and hit `Generate`. Your prompt would be first sent to the base model and you should see the generation response on the left chat box. If you did not provide a finetuned model, you should not see any generations from the right chat box.
+
+<figure>
+  <img src="assets/inference_screenshot.png" alt="Inference Screenshot" width="1000"/>
+  <figcaption>Base model inference on the UI</figcaption>
+</figure>
+
+As you can see, the base model is incapable of identifying the right events for this domain-specific task. Even if the base model can sometimes identify these events, it still only converts one form of unstructured data to another format of unstructured data. We cannot conduct reasonable data analytics for insights on large-scale video footage. Let's try to improve the model's accuracy and structured caption ability by performing SFT training.
+
+If you are proceeding to train a finetuned model, ensure that the streamlit demo UI is brought down before proceeding to train. You can bring it by interrupting the terminal with `Ctrl+C` keystroke.
+
+> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```
+
+## 4. SFT Finetuning
+
+We will perform SFT finetuning to improve the quality of the base model and generate schema-adhering structured output.
+
+### 4.1 Load the jupyter notebook
+
+```bash
+# Inside the container, navigate to the training directory
+cd train
+jupyter notebook video_vlm.ipynb
+```
+
+### 4.2 Train the model 
+
+Follow the instructions in the jupyter notebook to perform SFT finetuning on a video VLM. Ensure that you set the path to your dataset correctly in the appropriate cell.
+
+```python
+dataset_path = "/path/to/your/dataset"
+```
+
+### 4.3 Training Configuration
+
+Here are some of the key training parameters that are configurable. Please note that for reasonable quality, you will need to train your video VLM for atleast 24 hours given the complexity of processing spatio-temporal video sequences.

-Key training parameters configurable:
 - **Model**: InternVL3-8B
 - **Video Frames**: 12 to 16 frames per video
 - **Sampling Mode**: Uniform temporal sampling
 - **LoRA Configuration**: Efficient parameter updates for large-scale fine-tuning
 - **Hyperparameters**: Exhaustive suite of hyperparameters to tune for video VLM finetuning

-## Inference
+### 4.4 Monitor Training

-### Running Inference
+You can monitor and evaluate the training progress and metrics, as they will be continuously shown in the notebook.

-1. **Streamlit Web Interface**:
-   ```bash
-   # Start the interactive web interface
-   cd ui_video
-   streamlit run Video_VLM.py
-   ```
-   
-   The interface provides:
-   - Dashcam video gallery and playback
-   - Side-by-side comparison between base and finetuned model
-   - JSON output generation
-   - Tabular view of structured data extracted for analysis
+### 4.5 Shutdown

-2. **Configuration**: Edit `src/video_vlm_config.yaml` to modify model settings, frame count, and sampling strategy.
+After training, ensure that you shutdown the jupyter kernel in the notebook and kill the jupyter server in the terminal with a `Ctrl+C` keystroke.

-### Sample Output
-
-The model generates structured JSON output like:
-```json
-{
-    "caption": "A vehicle makes a dangerous lane change without signaling while speeding on a highway during daytime with clear weather conditions.",
-    "event_type": "near_miss",
-    "cause_of_risk": ["speeding", "risky_maneuver"],
-    "presence_of_rule_violations": ["failure_to_use_turn_signals"],
-    "intended_driving_action": ["change_lanes"],
-    "traffic_density": "medium",
-    "driving_setting": ["highway"],
-    "time_of_day": "day",
-    "light_conditions": "normal",
-    "weather": "clear",
-    "scene": "highway"
-}
+> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
 ```

-Inference Screenshot
+## 5. Finetuned Model Inference

-![WebUI Inference](assets/inference_screenshot.png)
+Now we are ready to perform a comparative analysis between the base model and the finetuned model. 
+
+### 5.1 (Optional) Spin up the Streamlit demo
+
+If you haven't spun up the streamlit demo already, execute the following command. If had just just stopped training and are still within the live UI, skip to the next step.
+
+```bash
+streamlit run Video_VLM.py
+```
+
+Access the streamlit demo at http://localhost:8501/.
+
+### 5.2 Wait for demo spin-up
+
+When you access the streamlit demo for the first time, the backend triggers Huggingface to spin up the base model. You will see a spinner on the demo site as the model is being loaded, which can take upto 10 mins.
+
+### 5.3 Run finetuned model inference
+
+Scroll down to the `Video Inference` section, and enter your prompt in the provided chat box. Upon clicking `Generate`, your prompt would be first sent to the base model and then to the finetuned model. You can use the following prompt to quickly test inference
+
+`Analyze the dashcam footage for unsafe driver behavior`
+
+If you trained your model sufficiently enough, you should see that the finetuned model is able to identify the salient events from the video and generate a structured output. 
+
+### 5.4 Further analysis
+
+Since the model's output adheres to the schema we trained, we can directly export the model's prediction into a database for video analytics. For the image shown below, we have trained the model for over 24 hours.
+
+<figure>
+  <img src="assets/finetuned_screenshot.png" alt="Finetuned Screenshot" width="1000"/>
+  <figcaption>Finetuned model inference on the UI</figcaption>
+</figure>
+
+Feel free to play around with additional videos available in the gallery. 

 ## File Structure

@ -127,11 +242,21 @@ ui_video/
 # Root directory also contains:
 ├── Dockerfile               # Multi-stage Docker build with FFmpeg/Decord
 └── launch.sh               # Docker launch script
+```
+               # Training checkpoints directory (update config to point here)
 ```

-## Model Capabilities
+## Troubleshooting

-The fine-tuned InternVL3 model can:
- **Video Analysis**: Process multi-frame dashcam footage for comprehensive scene understanding
- **Safety Detection**: Identify dangerous driving patterns, near-misses, and traffic violations
- **Structured Output**: Generate JSON metadata with standardized driving scene categories
+If you are facing VRAM issues where the model fails to load or offloads to cpu/meta device, ensure you bring down all docker containers and flush out dangling memory.
+
+```bash
+docker ps
+
+docker rm <CONTAINER_ID_1>
+docker rm <CONTAINER_ID_2>
+
+docker system prune
+
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```
--- a/nvidia/vlm-finetuning/assets/ui_video/Video_VLM.py
+++ b/nvidia/vlm-finetuning/assets/ui_video/Video_VLM.py
@ -28,6 +28,7 @@ import streamlit as st
 import torchvision.transforms as T
 from decord import VideoReader, cpu
 from transformers import AutoTokenizer, AutoModel
+from transformers.trainer_utils import get_last_checkpoint
 from torchvision.transforms.functional import InterpolationMode


@ -68,10 +69,10 @@ def random_id():
    return "".join(random.choices(chars, k=8)).lower()


-def initialize_session_state(resources):
+def initialize_session_state(base_model, finetuned_model):
    # Initialize page-specific session state
-    st.session_state["base_video_vlm"] = st.session_state.get("base_video_vlm", resources["base"])
-    st.session_state["finetuned_video_vlm"] = st.session_state.get("finetuned_video_vlm", resources["finetuned"])
+    st.session_state["base_video_vlm"] = st.session_state.get("base_video_vlm", base_model)
+    st.session_state["finetuned_video_vlm"] = st.session_state.get("finetuned_video_vlm", finetuned_model)
    st.session_state["current_sample"] = st.session_state.get("current_sample", None)
    st.session_state["df"] = st.session_state.get("df",
        pd.DataFrame(columns=[
@ -97,14 +98,9 @@ def load_config():


@st.cache_resource
-def initialize_resources(inference_config):
-    base_model = load_model_for_inference(inference_config, "base")
-    finetuned_model = load_model_for_inference(inference_config, "finetuned")
-
-    return {
-        "base": {"model": base_model},
-        "finetuned": {"model": finetuned_model},
-    }
+def initialize_model(model_path):
+    model = InternVLModel(model_path)
+    return {"model": model}


 def main():
@ -121,10 +117,15 @@ def main():
    config = load_config()
    if st.session_state.get("base_video_vlm", None) is None:
        st.toast("Loading model", icon="⏳", duration="short")
-    resource = initialize_resources(config["inference"])
+    base_model = initialize_model(config["inference"]["model_id"])
+    finetuned_model_path = get_last_checkpoint(config["inference"]["finetuned_model_id"])
+    if finetuned_model_path is not None:
+        finetuned_model = initialize_model(finetuned_model_path)
+    else:
+        finetuned_model = {"model": None}
    if st.session_state.get("base_video_vlm", None) is None:
        st.toast("Model loaded", icon="✅", duration="short")
-    initialize_session_state(resource)
+    initialize_session_state(base_model, finetuned_model)

    # gallery section
    st.markdown("---")
@ -194,13 +195,16 @@ def main():
                    response = start_inference("base")
                base_generation.markdown(response)

-                with st.spinner("Running..."):
-                    response = start_inference("finetuned")
-                finetuned_generation.markdown(response)
+                if st.session_state["finetuned_video_vlm"].get("model", None) is not None:
+                    with st.spinner("Running..."):
+                        response = start_inference("finetuned")
+                    finetuned_generation.markdown(response)

-                response = json.loads(response[7: -3].strip())
-                response["caption"] = random_id() # replace caption with driver id
-                st.session_state["df"].loc[len(st.session_state["df"])] = list(response.values())
+                    response = json.loads(response[7: -3].strip())
+                    response["caption"] = random_id() # replace caption with driver id
+                    st.session_state["df"].loc[len(st.session_state["df"])] = list(response.values())
+                else:
+                    finetuned_generation.markdown("```No response since there is no finetuned model```")

    # data analysis section
    st.markdown("---")
@ -349,7 +353,7 @@ class InternVLModel:
        )

        return response
-    
+
    def _infer_realtime(self, video_path, prompt, num_frames, chunk_duration):
        video = VideoReader(video_path, ctx=cpu(0), num_threads=1)
        fps = video.get_avg_fps()
@ -378,17 +382,6 @@ class InternVLModel:
            yield response


-def load_model_for_inference(config, model_type):
-    if model_type == "finetuned":
-        model_name = config["finetuned_model_id"]
-    elif model_type == "base":
-        model_name = config["model_id"]
-    else:
-        raise ValueError(f"Invalid model type: {model_type}")
-
-    return InternVLModel(model_name)
-
-
@torch.no_grad()
 def start_inference(model_type):
    # define prompt
@ -396,6 +389,7 @@ def start_inference(model_type):
    if model_type == "finetuned":
        prompt = SCAP_PROMPT.format(prompt=prompt)

+    print(print(model_type))
    response = st.session_state[f"{model_type}_video_vlm"]["model"].infer(
        st.session_state["current_sample"],
        prompt,
--- a/nvidia/vlm-finetuning/assets/ui_video/assets/finetuned_screenshot.png
+++ b/nvidia/vlm-finetuning/assets/ui_video/assets/finetuned_screenshot.png
--- a/nvidia/vlm-finetuning/assets/ui_video/assets/inference_screenshot.png
+++ b/nvidia/vlm-finetuning/assets/ui_video/assets/inference_screenshot.png
--- a/nvidia/vlm-finetuning/assets/ui_video/src/video_vlm_config.yaml
+++ b/nvidia/vlm-finetuning/assets/ui_video/src/video_vlm_config.yaml
@ -17,6 +17,6 @@

 inference:
  model_id: OpenGVLab/InternVL3-8B
-  finetuned_model_id: RLakshmi24/internvl_nexar_sft
+  finetuned_model_id: saved_model
  num_frames: 12
  sampling_mode: default
--- a/nvidia/vlm-finetuning/assets/ui_video/train/video_vlm.ipynb
+++ b/nvidia/vlm-finetuning/assets/ui_video/train/video_vlm.ipynb