dgx-spark-playbooks/README.md at 0cb12fc7517e7cd83cc952d5e341638f415cfe1a

sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

GitLab CI 4db9e4eac1 chore: Regenerate all playbooks

2025-10-13 14:55:45 +00:00

8.2 KiB

Raw Blame History

Setup multi-modal inference with TensorRT

Overview
Instructions
Troubleshooting

Overview

Basic idea

Multi-modal inference combines different data types, such as text, images, and audio, within a single model pipeline to generate or interpret richer outputs.
Instead of processing one input type at a time, multi-modal systems have shared representations that text-to-image generation, image captioning, or vision-language reasoning.

On GPUs, this enables parallel processing across modalities for faster, higher-fidelity results for tasks that combine language and vision.

What you'll accomplish

You'll deploy GPU-accelerated multi-modal inference capabilities on NVIDIA Spark using TensorRT to run Flux.1 and SDXL diffusion models with optimized performance across multiple precision formats (FP16, FP8, FP4).

What to know before starting

Working with Docker containers and GPU passthrough
Using TensorRT for model optimization
Hugging Face model hub authentication and downloads
Command-line tools for GPU workloads
Basic understanding of diffusion models and image generation

Prerequisites

NVIDIA Spark device with Blackwell GPU architecture
Docker installed and accessible to current user
NVIDIA Container Runtime configured
Hugging Face account with access to Black Forest Labs models FLUX.1-dev and FLUX.1-dev-onnx on Hugging Face
Hugging Face token configured with access to both FLUX.1 model repositories
At least 48GB VRAM available for FP16 Flux.1 Schnell operations
Verify GPU access: nvidia-smi
Check Docker GPU integration: docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi

Ancillary files

All necessary files can be found in the TensorRT repository here on GitHub

requirements.txt - Python dependencies for TensorRT demo environment
demo_txt2img_flux.py - Flux.1 model inference script
demo_txt2img_xl.py - SDXL model inference script
TensorRT repository - Contains diffusion demo code and optimization tools

Time & risk

Duration: 45-90 minutes depending on model downloads and optimization steps
Risks:
- Large model downloads may timeout
- High VRAM requirements may cause OOM errors
- Quantized models may show quality degradation
Rollback:
- Remove downloaded models from HuggingFace cache
- Then exit the container environment

Instructions

Step 1. Launch the TensorRT container environment

Start the NVIDIA PyTorch container with GPU access and HuggingFace cache mounting. This provides the TensorRT development environment with all required dependencies pre-installed.

docker run --gpus all --ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/pytorch:25.09-py3

Step 2. Clone and set up TensorRT repository

Download the TensorRT repository and configure the environment for diffusion model demos.

git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch && cd TensorRT
export TRT_OSSPATH=/workspace/TensorRT/
cd $TRT_OSSPATH/demo/Diffusion

Step 3. Install required dependencies

Install NVIDIA ModelOpt and other dependencies for model quantization and optimization.

## Install OpenGL libraries
apt update
apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6 libxrandr2 libxss1 libxcomposite1 libxdamage1 libxfixes3 libxcb1

pip install nvidia-modelopt[torch,onnx]
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
pip3 install -r requirements.txt

Step 4. Run Flux.1 Dev model inference

Test multi-modal inference using the Flux.1 Dev model with different precision formats.

Substep A. BF16 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --download-onnx-models --bf16

Substep B. FP8 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --quantization-level 4 --fp8 --download-onnx-models

Substep C. FP4 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --fp4 --download-onnx-models

Step 5. Run Flux.1 Schnell model inference

Test the faster Flux.1 Schnell variant with different precision formats.

Warning

FP16 Flux.1 Schnell requires >48GB VRAM for native export

Substep A. FP16 precision (high VRAM requirement)

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version="flux.1-schnell"

Substep B. FP8 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version="flux.1-schnell" \
  --quantization-level 4 --fp8 --download-onnx-models

Substep C. FP4 quantized precision

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version="flux.1-schnell" \
  --fp4 --download-onnx-models

Step 6. Run SDXL model inference

Test the SDXL model for comparison with different precision formats.

Substep A. BF16 precision

python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models

Substep B. FP8 quantized precision

python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --hf-token=$HF_TOKEN --version xl-1.0 --download-onnx-models --fp8

Step 7. Validate inference outputs

Check that the models generated images successfully and measure performance differences.

## Check for generated images in output directory
ls -la *.png *.jpg 2>/dev/null || echo "No image files found"

## Verify CUDA is accessible
nvidia-smi

## Check TensorRT version
python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}')"

Step 8. Cleanup and rollback

Remove downloaded models and exit container environment to free disk space.

Warning

This will delete all cached models and generated images

## Exit container
exit

## Remove HuggingFace cache (optional)
rm -rf $HOME/.cache/huggingface/

Step 9. Next steps

Use the validated setup to generate custom images or integrate multi-modal inference into your applications. Try different prompts or explore model fine-tuning with the established TensorRT environment.

Troubleshooting

Symptom	Cause	Fix
"CUDA out of memory" error	Insufficient VRAM for model	Use FP8/FP4 quantization or smaller model
"Invalid HF token" error	Missing or expired HuggingFace token	Set valid token: `export HF_TOKEN=<YOUR_TOKEN>`
Cannot access gated repo for URL	Certain HuggingFace models have restricted access	Regenerate your HuggingFace token; and request access to the gated model on your web browser
Model download timeouts	Network issues or rate limiting	Retry command or pre-download models

Note

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

8.2 KiB Raw Blame History

Multi-modal Inference

Table of Contents

Overview

Basic idea

What you'll accomplish

What to know before starting

Prerequisites

Ancillary files

Time & risk

Instructions

Step 1. Launch the TensorRT container environment

Step 2. Clone and set up TensorRT repository

Step 3. Install required dependencies

Step 4. Run Flux.1 Dev model inference

Step 5. Run Flux.1 Schnell model inference

Step 6. Run SDXL model inference

Step 7. Validate inference outputs

Step 8. Cleanup and rollback

Step 9. Next steps

Troubleshooting

8.2 KiB

Raw Blame History