dgx-spark-playbooks/nvidia/multi-modal-inference/README.md
2025-10-10 00:11:49 +00:00

224 lines
7.8 KiB
Markdown

# Multi-modal Inference
> Setup multi-modal inference with TensorRT
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Substep A. BF16 quantized precision](#substep-a-bf16-quantized-precision)
- [Substep B. FP8 quantized precision](#substep-b-fp8-quantized-precision)
- [Substep C. FP4 quantized precision](#substep-c-fp4-quantized-precision)
- [Substep A. FP16 precision (high VRAM requirement)](#substep-a-fp16-precision-high-vram-requirement)
- [Substep B. FP8 quantized precision](#substep-b-fp8-quantized-precision)
- [Substep C. FP4 quantized precision](#substep-c-fp4-quantized-precision)
- [Substep A. BF16 precision](#substep-a-bf16-precision)
- [Substep B. FP8 quantized precision](#substep-b-fp8-quantized-precision)
- [Troubleshooting](#troubleshooting)
---
## Overview
## What you'll accomplish
You'll deploy GPU-accelerated multi-modal inference capabilities on NVIDIA Spark using TensorRT to run
Flux.1 and SDXL diffusion models with optimized performance across multiple precision formats (FP16,
FP8, FP4).
## What to know before starting
- Working with Docker containers and GPU passthrough
- Using TensorRT for model optimization
- Hugging Face model hub authentication and downloads
- Command-line tools for GPU workloads
- Basic understanding of diffusion models and image generation
## Prerequisites
- NVIDIA Spark device with Blackwell GPU architecture
- Docker installed and accessible to current user
- NVIDIA Container Runtime configured
- Hugging Face account with valid token
- At least 48GB VRAM available for FP16 Flux.1 Schnell operations
- Verify GPU access: `nvidia-smi`
- Check Docker GPU integration: `docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi`
- Confirm HF token access with permissions to FLUX repos: `echo $HF_TOKEN`, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added.
## Ancillary files
All necessary files can be found in the TensorRT repository [here on GitHub](https://github.com/NVIDIA/TensorRT)
- **requirements.txt** - Python dependencies for TensorRT demo environment
- **demo_txt2img_flux.py** - Flux.1 model inference script
- **demo_txt2img_xl.py** - SDXL model inference script
- **TensorRT repository** - Contains diffusion demo code and optimization tools
## Time & risk
**Duration**: 45-90 minutes depending on model downloads and optimization steps
**Risks**: Large model downloads may timeout; high VRAM requirements may cause OOM errors;
quantized models may show quality degradation
**Rollback**: Remove downloaded models from HuggingFace cache, exit container environment
## Instructions
## Step 1. Launch the TensorRT container environment
Start the NVIDIA PyTorch container with GPU access and HuggingFace cache mounting. This provides
the TensorRT development environment with all required dependencies pre-installed.
```bash
docker run --gpus all --ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/pytorch:25.09-py3
```
## Step 2. Clone and set up TensorRT repository
Download the TensorRT repository and configure the environment for diffusion model demos.
```bash
git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch && cd TensorRT
export TRT_OSSPATH=/workspace/TensorRT/
cd $TRT_OSSPATH/demo/Diffusion
```
## Step 3. Install required dependencies
Install NVIDIA ModelOpt and other dependencies for model quantization and optimization.
```bash
## Install OpenGL libraries
apt update
apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6 libxrandr2 libxss1 libxcomposite1 libxdamage1 libxfixes3 libxcb1
pip install nvidia-modelopt[torch,onnx]
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
pip3 install -r requirements.txt
```
## Step 4. Run Flux.1 Dev model inference
Test multi-modal inference using the Flux.1 Dev model with different precision formats.
### Substep A. BF16 quantized precision
```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --download-onnx-models --bf16
```
### Substep B. FP8 quantized precision
```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --quantization-level 4 --fp8 --download-onnx-models
```
### Substep C. FP4 quantized precision
```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --fp4 --download-onnx-models
```
## Step 5. Run Flux.1 Schnell model inference
Test the faster Flux.1 Schnell variant with different precision formats.
> **Warning**: FP16 Flux.1 Schnell requires >48GB VRAM for native export
### Substep A. FP16 precision (high VRAM requirement)
```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version="flux.1-schnell"
```
### Substep B. FP8 quantized precision
```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version="flux.1-schnell" \
--quantization-level 4 --fp8 --download-onnx-models
```
### Substep C. FP4 quantized precision
```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version="flux.1-schnell" \
--fp4 --download-onnx-models
```
## Step 6. Run SDXL model inference
Test the SDXL model for comparison with different precision formats.
### Substep A. BF16 precision
```bash
python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models
```
### Substep B. FP8 quantized precision
```bash
python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models --fp8
```
## Step 7. Validate inference outputs
Check that the models generated images successfully and measure performance differences.
```bash
## Check for generated images in output directory
ls -la *.png *.jpg 2>/dev/null || echo "No image files found"
## Verify CUDA is accessible
nvidia-smi
## Check TensorRT version
python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}')"
```
## Step 8. Cleanup and rollback
Remove downloaded models and exit container environment to free disk space.
> **Warning**: This will delete all cached models and generated images
```bash
## Exit container
exit
## Remove HuggingFace cache (optional)
rm -rf $HOME/.cache/huggingface/
```
## Step 9. Next steps
Use the validated setup to generate custom images or integrate multi-modal inference into your
applications. Try different prompts or explore model fine-tuning with the established TensorRT
environment.
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| "CUDA out of memory" error | Insufficient VRAM for model | Use FP8/FP4 quantization or smaller model |
| "Invalid HF token" error | Missing or expired HuggingFace token | Set valid token: `export HF_TOKEN=<YOUR_TOKEN>` |
| Model download timeouts | Network issues or rate limiting | Retry command or pre-download models |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```