dgx-spark-playbooks/nvidia/station-comfyui/README.md
2026-05-27 16:00:20 +00:00

26 KiB
Raw Blame History

Image & Video Generation with ComfyUI

Generate images and videos with FLUX, Wan 2.1, HunyuanVideo, and Cosmos on DGX Station

Table of Contents


Overview

Basic idea

ComfyUI is a node-based visual interface for building image and video generation workflows using diffusion models. Instead of a single text box, you connect processing nodes — model loaders, text encoders, samplers, decoders — into a graph that gives full control over every generation step.

  • Node-based workflows let you build, modify, and share complex generation pipelines visually.
  • Multi-model support covers the latest architectures: FLUX for images, Wan 2.1 and HunyuanVideo for video, and NVIDIA Cosmos for world generation.
  • Full precision on GB300 — with 252 GB of HBM3e, you can run 1217B image models and 1314B video models at bf16 with no quantization or offloading, which is impossible on consumer hardware.

What you'll accomplish

Deploy ComfyUI on DGX Station and run image and video generation workflows using six state-of-the-art models:

  • FLUX.1 [dev] (12B) — high-quality text-to-image generation
  • HiDream-I1 Full (17B) — the largest open image model, with four text encoders including Llama-3.1-8B
  • Wan 2.1 T2V/I2V 14B — text-to-video and image-to-video at 720p
  • HunyuanVideo (13B) — 1080p video generation leveraging the full GB300 memory (~100120 GB VRAM)
  • NVIDIA Cosmos-Predict2 (14B) — NVIDIA's world foundation model for video-to-world generation

You will also learn advanced techniques including ControlNet-guided generation and combined image-to-video pipelines.

What to know before starting

  • Basic Docker container usage
  • Familiarity with generative AI concepts (prompts, diffusion models) is helpful but not required

Prerequisites

  • NVIDIA DGX Station with GB300 GPU
  • Docker installed: docker --version
  • NVIDIA Container Toolkit configured: nvidia-smi should show the GB300
  • HuggingFace account with access token: https://huggingface.co/settings/tokens
  • At least 200 GB free disk space for model weights
  • Network access to HuggingFace and GitHub

Ancillary files

All required assets can be found in the ComfyUI playbook repository.

  • assets/Dockerfile — Builds the ComfyUI container image from NGC PyTorch base (ARM64)
  • assets/scripts/download-models.sh — Downloads all model weights from Hugging Face using the hf CLI (huggingface-hub package)
  • assets/workflows/*.json — Eight UI workflows (ComfyUI 0.4 graph with nodes / links) for Load in the web UI
  • assets/workflow_api/*.api.json — The same eight graphs in API format for /prompt and automation (curl, scripts)
  • assets/scripts/api_to_ui_workflow.py — Regenerates UI JSON from API JSON if you edit a graph programmatically

Time & risk

  • Duration: 45 minutes (excluding model downloads, which may take 3060 minutes depending on network speed)
  • Risks:
    • Model downloads require HuggingFace authentication and substantial bandwidth (~150 GB total)
    • Port 8188 must be accessible for the ComfyUI web interface
  • Rollback: Stop and remove the Docker container. Delete the models/ directory to reclaim disk space.
  • Last Updated: 05/26/2026
    • First Publication

Instructions

Step 1. Verify your environment

Confirm Docker, GPU access, and available disk space.

docker --version
nvidia-smi
df -h /
  • Docker: Must be running (version 24+ recommended).
  • nvidia-smi: Should list the GB300 GPU with 252 GB HBM3e.
  • Disk space: At least 200 GB free on / for model weights and the Docker image. On DGX Station /home is on the root filesystem, so checking / covers both. You can download fewer models by choosing a tier (see Step 4).

If you haven't already, add your user to the docker group:

sudo usermod -aG docker $USER
newgrp docker

Step 2. Set up environment variables

Set your HuggingFace token so the download script and container can access gated models.

## HuggingFace token (required). Run this in the SAME shell that will
## launch `bash assets/scripts/download-models.sh` in Step 4 — the script
## reads $HF_TOKEN from the environment and exits early if it is unset.
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

Some models (FLUX.1, HiDream-I1) require accepting the model license on HuggingFace before downloading. Visit each model page and click "Agree and access" if prompted:

Step 3. Clone the playbook and build the container

Clone the playbook repository and build the ComfyUI Docker image. The image is built on top of the NGC PyTorch container, which is already optimized for the GB300's ARM64 architecture.

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-comfyui

Build the container image:

docker build -t comfyui-gb300 -f assets/Dockerfile .

The build clones ComfyUI, installs dependencies (preserving the NGC-optimized PyTorch), and pre-installs custom nodes for video generation, ControlNet, and IP-Adapter. This takes approximately 510 minutes.

Step 4. Download models

This playbook uses models organized into three tiers. Download only what you need, or download everything.

Tier Models Disk space Peak VRAM (approx.) Workflows enabled
1 — Getting Started FLUX.1 dev, Wan 2.1 T2V 14B ~70 GB ~80 GB (Wan 720p clip) Text-to-image, text-to-video
2 — Intermediate + HiDream-I1, Wan 2.1 I2V, Cosmos-Predict2 ~180 GB ~100 GB (FLUX→Wan two-model graph) + HiDream image gen, image-to-video, FLUX→Wan pipeline, Cosmos Video2World
3 — Advanced + HunyuanVideo, FLUX ControlNet (Canny) ~230 GB ~120 GB (Hunyuan 1080p / long clips) + 1080p video, ControlNet-guided generation

Peak VRAM depends on resolution, frame count, and precision; values above are order-of-magnitude for the default graphs in this playbook on a GB300 (252 GB HBM3e).

Install the Hugging Face Hub CLI (provides the hf command) if you do not already have it. The CLI installs to ~/.local/bin/, which is not on the default non-interactive PATH, so add it before continuing:

pip3 install --break-system-packages huggingface-hub
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/bin:$PATH"
hf --version   # confirms PATH is correct

Run the download script with your chosen tier (default downloads all):

## Download Tier 1 only (Getting Started):
bash assets/scripts/download-models.sh 1

## Download all tiers:
bash assets/scripts/download-models.sh

Model downloads can take 3060 minutes depending on network speed. The script uses the Hugging Face Hub hf download command (from the huggingface_hub package). If a download fails, the script exits with an error and prints which file was expected — check your token, network, and that you have accepted gated model licenses on Hugging Face.

After Tier 1 completes, verify weights landed under models/:

ls -la ./models/diffusion_models/
ls -la ./models/text_encoders/ | head

Step 5. Launch ComfyUI

Start the ComfyUI container with all model and output directories mounted as volumes. On DGX Station, identify the GB300 GPU index with nvidia-smi and use --gpus '"device=N"' to target it. If the GB300 is your only GPU, --gpus all also works.

## Find the GB300 device index (look for "GB300" in the Name column)
nvidia-smi --query-gpu=index,name --format=csv,noheader

The default --gpus '"device=0"' works on single-GPU stations where the GB300 is index 0. If nvidia-smi reports the GB300 at a different index (for example index 1 on dual-GPU stations with an RTX PRO 6000 + GB300), substitute that index in the command below.

## device=0 by default; replace with the GB300 index from the command above
docker run -d \
  --name comfyui \
  --gpus '"device=0"' \
  --ipc host \
  --ulimit memlock=-1 \
  -p 8188:8188 \
  -v "$(pwd)/models:/opt/ComfyUI/models" \
  -v "$(pwd)/output:/opt/ComfyUI/output" \
  -v "$(pwd)/input:/opt/ComfyUI/input" \
  -v "$(pwd)/assets/workflows:/opt/ComfyUI/user/default/workflows" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  comfyui-gb300

Check startup logs:

docker logs -f comfyui

Expected output includes custom node loading messages and:

To see the GUI go to: http://0.0.0.0:8188

Press Ctrl+C to exit the log view. Open a web browser and navigate to http://<STATION_IP>:8188 where <STATION_IP> is your DGX Station's IP address.

Note

The startup logs include several benign warnings you can ignore: aimdo: ... funchook_prepare(cuMemFree_v2) failed (NGC PyTorch's CUDA hooks tool falling back to no-op), urllib3 / charset_normalizer doesn't match a supported version, torchaudio missing (covered by the import-only stub — no playbook workflow uses audio VAE), DWPose: Onnxruntime not found ... switch to OpenCV with CPU device (aarch64 has no onnxruntime-gpu wheel; CPU preprocessing still works), and accelerate / GPTQModel / optimum / bitsandbytes not installed from the HiDream sampler. The real "ready" signal is the To see the GUI go to: ... line above; treat anything else as suspect.

UI workflows vs API graphs (important)

ComfyUI uses two different JSON shapes:

Location Format Use
assets/workflows/*.json mounted at user/default/workflows/ UI workflow (has "nodes" and "links") Load in the web UI, edit in the canvas, then Queue Prompt
assets/workflow_api/*.api.json (on the host repo, not mounted into the default workflow folder) API prompt graph (flat node ids → class_type / inputs) POST /prompt, curl, automation

If you open an .api.json file with Load, the UI shows "Error: the workflow does not contain any nodes" — that is expected; those files are not UI workflows.

Optional — run the same graph via HTTP API (from the playbook root, with ComfyUI listening on port 8188). Strip any non-node keys (for example _comment in some API files), minify to one line, and POST:

PROMPT=$(python3 -c "import json; d=json.load(open('assets/workflow_api/flux-text-to-image.api.json')); print(json.dumps({k:v for k,v in d.items() if str(k).isdigit()}, separators=(',',':')))")
curl -sS http://127.0.0.1:8188/prompt \
  -X POST \
  -H "Content-Type: application/json" \
  -d "{\"prompt\":${PROMPT}}" | python3 -m json.tool

The response includes a prompt_id you can correlate with server logs and the output/ folder.

ComfyUI interface orientation:

  • Canvas — The central area where you build and view node workflows.
  • Queue Prompt — The button (top right) that runs the current workflow.
  • Load — Load a UI workflow from flux-text-to-image.json, wan-text-to-video.json, etc. (listed in the workflow sidebar under the mounted folder).
  • Manager — Access ComfyUI-Manager to install additional custom nodes.

Step 6. Image generation with FLUX.1 dev

Requires: Tier 1 models

Load the pre-built FLUX text-to-image workflow. In ComfyUI, click Load and select flux-text-to-image.json (UI format). Do not use the *.api.json files in assets/workflow_api/ with Load — they are for the HTTP API only.

What this workflow does:

The workflow connects these nodes in sequence:

  1. UNETLoader — Loads the FLUX.1 dev 12B transformer (~24 GB in bf16) with weight_dtype=default.
  2. DualCLIPLoader — Loads CLIP-L and T5-XXL text encoders that convert your prompt into conditioning vectors.
  3. CLIP Text Encode — Takes your text prompt and produces positive conditioning.
  4. FluxGuidance — Applies FLUX's guidance value (default 3.5) to the conditioning.
  5. EmptySD3LatentImage — Creates a blank latent at your chosen resolution (default: 1024x1024).
  6. ModelSamplingFlux + BasicScheduler + KSamplerSelect + BasicGuider + RandomNoise — Configure FLUX's flow-matching schedule (20 steps, euler/simple).
  7. SamplerCustomAdvanced — The diffusion sampling loop that denoises the latent.
  8. VAE Decode — Converts the latent back into a pixel image.
  9. Save Image — Writes the result to the output/ directory.

Try it:

  1. Find the CLIP Text Encode node and enter a prompt, for example: A majestic snow leopard resting on a cliff at golden hour, photorealistic, 8k detail
  2. Click Queue Prompt.
  3. The image generates in approximately 1530 seconds. Results appear in the output/ directory and in the preview node.

Experiment with different prompts, resolutions (512x512 up to 2048x2048), and step counts. FLUX.1 dev produces high-quality results even at 20 steps.

Step 7. Video generation with Wan 2.1

Requires: Tier 1 models

Load wan-text-to-video.json from the workflow browser.

What this workflow does:

  1. Load Diffusion Model — Loads the Wan 2.1 T2V 14B model (~28 GB in bf16).
  2. CLIPLoader — Loads the UMT5-XXL text encoder for Wan.
  3. CLIP Text Encode — Encodes your video description prompt.
  4. EmptyHunyuanLatentVideo — Creates a blank video latent (default: 720p, 81 frames at ~16 fps ≈ 5 seconds). Wan reuses this latent format.
  5. KSampler — Diffusion sampling over the video latent. This is the slowest step — expect 35 minutes for a 5-second clip on the GB300.
  6. VAE Decode — Converts latents to video frames.
  7. SaveAnimatedWEBP — Encodes frames into an animated WEBP file.

Try it:

  1. Enter a prompt: A drone shot flying over a misty mountain forest at sunrise, cinematic
  2. Click Queue Prompt.
  3. Generation takes 310 minutes at 720p with 81 frames. Monitor GPU memory with nvidia-smi in another terminal — the 14B model at 720p uses approximately 6580 GB of the GB300's 252 GB HBM3e.
  4. The output .webp (animated WEBP from SaveAnimatedWEBP) appears in the output/ directory. To convert to MP4, use ffmpeg -i output/wan_t2v_output_00001_.webp output/wan_t2v_output.mp4.

Tips:

  • Reduce frame count (e.g., 49 frames ≈ 3 seconds) for faster iteration.
  • Wan 2.1 responds well to cinematic, descriptive prompts with camera movement descriptions.

Step 8. Intermediate workflows

Requires: Tier 2 models

This step introduces four additional workflows. Each builds on the basics from Steps 67.

HiDream-I1 image generation

Load hidream-text-to-image.json.

HiDream-I1 Full is a 17B parameter image model that uses four text encoders — CLIP-L, CLIP-G, T5-XXL, and Llama-3.1-8B-Instruct. The Llama encoder gives it exceptional prompt understanding, especially for complex or nuanced descriptions.

The full pipeline uses approximately 6065 GB in bf16 — well within the GB300's capacity but impossible on most GPUs.

Try it: Use a detailed, complex prompt to see the difference from FLUX — for example: An astronaut riding a horse on Mars, with Earth visible in the sky, oil painting style by Rembrandt, dramatic chiaroscuro lighting

Wan 2.1 image-to-video

Load wan-image-to-video.json.

This workflow takes an input image and animates it into a video clip. Place your source image in the input/ directory before running.

  1. The LoadImage node reads from input/.
  2. The Wan 2.1 I2V 14B model generates motion that is consistent with the source image.

Try it: Generate an image with FLUX first (Step 6), copy it from output/ to input/, then animate it.

FLUX → Wan combined pipeline

Load flux-to-wan-pipeline.json.

This workflow chains two models in a single graph:

  1. FLUX.1 dev generates a high-quality still image from your text prompt.
  2. The image is passed directly to Wan 2.1 I2V 14B, which animates it into a video.

This avoids manually moving files between workflows. Both models load into GPU memory simultaneously (~95 GB total in bf16).

Cosmos-Predict2 Video2World

Load cosmos-video2world.json.

NVIDIA Cosmos-Predict2 14B is NVIDIA's world foundation model for Video2World generation. It takes an input image and generates a physically plausible video extending from that scene. Place your source image in the input/ directory before running.

The Cosmos VAE is extremely efficient — it can encode/decode 1280x704 at 121 frames without tiling.

Try it: Use an image from a previous FLUX generation as the start frame, with a prompt describing the motion: A red ball rolling down a wooden ramp and bouncing off a wall, physics simulation, realistic lighting

Step 9. Advanced workflows

Requires: Tier 3 models

HunyuanVideo 1080p generation

Load hunyuan-1080p-video.json.

This is the true GB300 showcase. HunyuanVideo's 13B model generating at 1080p resolution uses approximately 100120 GB of VRAM — impossible on any consumer or professional GPU, but well within the GB300's 252 GB.

  • Default: 1920x1056, 49 frames (~3 seconds). Note: height must be divisible by 16 for HunyuanVideo's latent space, so 1056 is used instead of 1080.
  • Generation time: 25 minutes for 49 frames, longer for more.
  • Monitor with nvidia-smi — you should see 100+ GB GPU memory usage.

Try it: A time-lapse of cherry blossoms falling in a Japanese garden with a koi pond, 4K cinematic

ControlNet with FLUX

Load flux-controlnet.json.

ControlNet lets you guide image generation with structural conditioning — edges, depth maps, or pose skeletons extracted from a reference image.

  1. Place a reference image in input/.
  2. The Canny Edge Preprocessor extracts edge structure from the reference.
  3. The FLUX.1 Canny Dev model (a full FLUX variant fine-tuned for canny conditioning) generates an image following that structure while applying the text prompt's style and content.
  4. Both the preprocessed canny image and the final output are saved for comparison.

Use cases: Architectural visualization, consistent character poses, style transfer while preserving composition.

Step 10. Cleanup

Stop and remove the ComfyUI container:

docker stop comfyui
docker rm comfyui

Note

Files in output/ and models/ are written by the container as root, so removing them from the host shell needs sudo (e.g. sudo rm -rf models/). To avoid this in future runs, add --user "$(id -u):$(id -g)" to the docker run command in Step 5 — note that this requires the host UID to have write access to all mounted directories.

Optionally remove the Docker image:

docker rmi comfyui-gb300

Optionally remove downloaded models to reclaim disk space:

rm -rf models/

Generated images and videos in output/ are preserved on the host regardless of container state.

Troubleshooting

Common issues

Symptom Cause Fix
"permission denied" when running docker User not in docker group Run sudo usermod -aG docker $USER && newgrp docker
Container fails to start with GPU error NVIDIA Container Toolkit not configured Run nvidia-ctk runtime configure --runtime=docker and restart Docker
ComfyUI web UI not accessible Firewall blocking port or wrong IP Verify with docker logs comfyui, check that port 8188 is open, use http://<STATION_IP>:8188
"Model file not found" when running workflow Model not downloaded or wrong path Verify models are in ./models/ and the volume mount is correct in the docker run command
HuggingFace download fails with 401 Invalid or missing HF token Verify HF_TOKEN is exported and valid at huggingface.co/settings/tokens
CUDA out of memory during video generation Frame count or resolution too high Reduce frame count or resolution. At 720p with Wan 2.1 14B, keep clips under 5 seconds initially
CUDA out of memory during 1080p HunyuanVideo Model + video tensors exceed GPU memory Use fewer frames (e.g., 49 instead of 97). HunyuanVideo at 1080p needs ~100-120 GB
Workflow loads but nodes show red "missing" Custom node not installed Use ComfyUI-Manager (click Manager → Install Missing Custom Nodes) or rebuild the Docker image
Video output is a black screen VAE decode issue or wrong model variant Ensure you are using the correct model variant (T2V vs I2V) and the VAE is loaded
Very slow generation, GPU utilization low PyTorch not using GPU or wrong CUDA version Run nvidia-smi inside container: docker exec comfyui nvidia-smi. Ensure GPU is visible
"No module named ..." error on startup Custom node dependency not installed Exec into container and install: docker exec comfyui pip install <module> then restart
Docker build fails on ARM64 with Could not find a version that satisfies the requirement onnxruntime-gpu onnxruntime-gpu has no aarch64 wheel on PyPI Already handled by the shipped Dockerfile, which sed-substitutes onnxruntime-gpuonnxruntime (CPU build) in every custom_node requirements.txt before pip install. If you see this error, you are building from a Dockerfile predating that fix — pull the latest assets and rebuild.
Docker build fails on ARM64 (other packages) Some custom-node dependencies have no aarch64 wheel Find the failing package in the build log. The custom-node install loop is wrapped in || true, so the build still completes but the affected node will be missing modules at runtime. Either skip the node (remove its directory from custom_nodes/ in the Dockerfile clone block) or install via ComfyUI-Manager after launch with a manually built wheel.
NGC image pull requires authentication NGC registry needs login Run docker login nvcr.io with your NGC API key
device >= 0 && device < num_gpus INTERNAL ASSERT FAILED on startup Using --gpus all on a multi-GPU system causes a PyTorch assertion Use --gpus '"device=N"' to target the GB300 specifically (check index with nvidia-smi)
No HiDream models available warning on startup HiDream custom node reports no models found This is a warning, not an error. It clears once HiDream model files are downloaded (Tier 2)
Web UI: "Error: the workflow does not contain any nodes" when using Load The file is API format (flat node_id → {class_type, inputs}), not a UI workflow In the playbook, use assets/workflows/<name>.json in the Load dialog (under user/default/workflows inside the container). For curl / HTTP API, use assets/workflow_api/<name>.api.json inside {"prompt": ...}.
huggingface-cli: command not found or download script errors Deprecated CLI name Install huggingface_hub and use hf download (the script does this automatically).
Download script exits but models/diffusion_models/ is empty Silent failure in older scripts or wrong token Re-run with bash -x assets/scripts/download-models.sh 1; confirm HF_TOKEN and license acceptance on Hugging Face. The script now fails fast if a file is missing after hf download.
Container exits on startup with ModuleNotFoundError: torchaudio Container was built from a Dockerfile predating the torchaudio shim Rebuild the image: docker build -t comfyui-gb300 -f assets/Dockerfile .. The shipped Dockerfile creates an import-only torchaudio stub (NGC PyTorch's custom NVFP4 ABI is incompatible with PyPI torchaudio wheels). Lightricks audio VAE workflows are not supported in this image; no other workflow needs torchaudio.
OSError: ... undefined symbol: torch_dtype_float4_e2m1fn_x2 from torchaudio Real torchaudio installed on top of NGC PyTorch Same fix as above — rebuild from the shipped Dockerfile. Do not pip install torchaudio manually inside the container.
DWPose: Onnxruntime not found or doesn't come with acceleration providers, switch to OpenCV with CPU device Expected on aarch64. PyPI has no onnxruntime-gpu wheel for arm64; the Dockerfile substitutes the CPU onnxruntime package Informational warning, not an error. DWPose preprocessing runs on CPU (slower than GPU) but produces correct output.
aimdo: ... funchook_prepare(cuMemFree_v2) failed: 8 Failed to allocate memory in unused regions at startup NGC PyTorch's CUDA-hooks diagnostic tool (aimdo) cannot install hooks under default container caps and falls back to no-op Benign. ComfyUI works normally; the message is informational from the NGC base image. No action required.
RequestsDependencyWarning: urllib3 (...) or charset_normalizer (...) doesn't match a supported version! at startup Version skew between requests and the NGC-pinned urllib3 / charset_normalizer wheels Benign. ComfyUI's HTTP traffic still works. Suppress with PYTHONWARNINGS=ignore::requests.RequestsDependencyWarning if it bothers you.

Note

ComfyUI logs are visible with docker logs -f comfyui. Most errors (missing models, node failures) are reported in these logs with clear messages.