7.4 KiB
Multi-modal Inference
Setup multi-modal inference with TensorRT
Table of Contents
Overview
What you'll accomplish
You'll deploy GPU-accelerated multi-modal inference capabilities on NVIDIA Spark using TensorRT to run Flux.1 and SDXL diffusion models with optimized performance across multiple precision formats (FP16, FP8, FP4).
What to know before starting
- Working with Docker containers and GPU passthrough
- Using TensorRT for model optimization
- Hugging Face model hub authentication and downloads
- Command-line tools for GPU workloads
- Basic understanding of diffusion models and image generation
Prerequisites
- NVIDIA Spark device with Blackwell GPU architecture
- Docker installed and accessible to current user
- NVIDIA Container Runtime configured
- Hugging Face account with valid token
- At least 48GB VRAM available for FP16 Flux.1 Schnell operations
- Verify GPU access:
nvidia-smi - Check Docker GPU integration:
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi - Confirm HF token access with permissions to FLUX repos:
echo $HF_TOKEN, Sign in to your huggingface account You can create the token from create your token here (make sure you provide permissions to the token): https://huggingface.co/settings/tokens , Note the permissions to be checked and the repos: black-forest-labs/FLUX.1-dev and black-forest-labs/FLUX.1-dev-onnx (search for these repos when creating the user token) to be added.
Ancillary files
All necessary files can be found in the TensorRT repository here on GitHub
- requirements.txt - Python dependencies for TensorRT demo environment
- demo_txt2img_flux.py - Flux.1 model inference script
- demo_txt2img_xl.py - SDXL model inference script
- TensorRT repository - Contains diffusion demo code and optimization tools
Time & risk
Duration: 45-90 minutes depending on model downloads and optimization steps
Risks: Large model downloads may timeout; high VRAM requirements may cause OOM errors; quantized models may show quality degradation
Rollback: Remove downloaded models from HuggingFace cache, exit container environment
Instructions
Step 1. Launch the TensorRT container environment
Start the NVIDIA PyTorch container with GPU access and HuggingFace cache mounting. This provides the TensorRT development environment with all required dependencies pre-installed.
docker run --gpus all --ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/pytorch:25.09-py3
Step 2. Clone and set up TensorRT repository
Download the TensorRT repository and configure the environment for diffusion model demos.
git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch && cd TensorRT
export TRT_OSSPATH=/workspace/TensorRT/
cd $TRT_OSSPATH/demo/Diffusion
Step 3. Install required dependencies
Install NVIDIA ModelOpt and other dependencies for model quantization and optimization.
## Install OpenGL libraries
apt update
apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6 libxrandr2 libxss1 libxcomposite1 libxdamage1 libxfixes3 libxcb1
pip install nvidia-modelopt[torch,onnx]
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
pip3 install -r requirements.txt
Step 4. Run Flux.1 Dev model inference
Test multi-modal inference using the Flux.1 Dev model with different precision formats.
Substep A. BF16 quantized precision
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --download-onnx-models --bf16
Substep B. FP8 quantized precision
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --quantization-level 4 --fp8 --download-onnx-models
Substep C. FP4 quantized precision
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --fp4 --download-onnx-models
Step 5. Run Flux.1 Schnell model inference
Test the faster Flux.1 Schnell variant with different precision formats.
Warning
: FP16 Flux.1 Schnell requires >48GB VRAM for native export
Substep A. FP16 precision (high VRAM requirement)
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version="flux.1-schnell"
Substep B. FP8 quantized precision
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version="flux.1-schnell" \
--quantization-level 4 --fp8 --download-onnx-models
Substep C. FP4 quantized precision
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version="flux.1-schnell" \
--fp4 --download-onnx-models
Step 6. Run SDXL model inference
Test the SDXL model for comparison with different precision formats.
Substep A. BF16 precision
python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models
Substep B. FP8 quantized precision
python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models --fp8
Step 7. Validate inference outputs
Check that the models generated images successfully and measure performance differences.
## Check for generated images in output directory
ls -la *.png *.jpg 2>/dev/null || echo "No image files found"
## Verify CUDA is accessible
nvidia-smi
## Check TensorRT version
python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}')"
Step 8. Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "CUDA out of memory" error | Insufficient VRAM for model | Use FP8/FP4 quantization or smaller model |
| "Invalid HF token" error | Missing or expired HuggingFace token | Set valid token: export HF_TOKEN=<YOUR_TOKEN> |
| Model download timeouts | Network issues or rate limiting | Retry command or pre-download models |
Step 9. Cleanup and rollback
Remove downloaded models and exit container environment to free disk space.
Warning
: This will delete all cached models and generated images
## Exit container
exit
## Remove HuggingFace cache (optional)
rm -rf $HOME/.cache/huggingface/
Step 10. Next steps
Use the validated setup to generate custom images or integrate multi-modal inference into your applications. Try different prompts or explore model fine-tuning with the established TensorRT environment.