diff --git a/nvidia/station-comfyui/README.md b/nvidia/station-comfyui/README.md index 0cc2b64..f69e353 100644 --- a/nvidia/station-comfyui/README.md +++ b/nvidia/station-comfyui/README.md @@ -55,7 +55,7 @@ You will also learn advanced techniques including ControlNet-guided generation a ## Ancillary files -All required assets can be found [in the ComfyUI playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-comfyui/). +All required assets can be found in the [ComfyUI playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-comfyui/). - `assets/Dockerfile` — Builds the ComfyUI container image from NGC PyTorch base (ARM64) - `assets/scripts/download-models.sh` — Downloads all model weights from Hugging Face using the **`hf`** CLI (`huggingface-hub` package) @@ -70,9 +70,8 @@ All required assets can be found [in the ComfyUI playbook repository](https://gi * Model downloads require HuggingFace authentication and substantial bandwidth (~150 GB total) * Port 8188 must be accessible for the ComfyUI web interface * **Rollback:** Stop and remove the Docker container. Delete the `models/` directory to reclaim disk space. -* **Last Updated:** 05/07/2026 - * Re-validated end-to-end on GB300: clean image build (`comfyui-gb300`, ~24 GB), container starts and serves on port 8188, all 8 mounted UI workflows enumerate correctly, `/object_info` returns 1092 node types, `/prompt` validation rejects on missing-model with clean errors. Documented benign startup warnings (`aimdo` CUDA-hook fallback, `urllib3` / `charset_normalizer` version skew) so users do not chase non-issues. - * 05/06/2026 — first publication; fixed walkthrough issues found on GB300: torchaudio shim for NGC PyTorch ABI mismatch, aarch64 onnxruntime swap, model-filename collisions (HiDream VAE → `ae_hidream.safetensors`, HunyuanVideo CLIP → `clip_l_hunyuan.safetensors`), `--gpus device=0` default, `df -h /` prereq, `~/.local/bin` PATH guidance, FLUX node list aligned with the actual graph, `.webp` output (not MP4), HF token via env not CLI, container output `chown` cleanup hint. +* **Last Updated:** 05/26/2026 + * First Publication ## Instructions @@ -336,7 +335,7 @@ This avoids manually moving files between workflows. Both models load into GPU m ### Cosmos-Predict2 Video2World -Load `cosmos-text-to-video.json`. +Load `cosmos-video2world.json`. **NVIDIA Cosmos-Predict2 14B** is NVIDIA's world foundation model for Video2World generation. It takes an input image and generates a physically plausible video extending from that scene. Place your source image in the `input/` directory before running. diff --git a/nvidia/station-comfyui/endpoint-production.yaml b/nvidia/station-comfyui/endpoint-production.yaml new file mode 100644 index 0000000..f4d0cb4 --- /dev/null +++ b/nvidia/station-comfyui/endpoint-production.yaml @@ -0,0 +1,496 @@ +kind: Playbook +metadata: + name: station-comfyui + displayName: Image & Video Generation with ComfyUI + shortDescription: Generate images and videos with FLUX, Wan 2.1, HunyuanVideo, and Cosmos on DGX Station + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - GB300 + - Image Generation + - Video Generation + - ComfyUI + - FLUX + - Wan 2.1 + - HunyuanVideo + - Cosmos + - Docker + + attributes: + - key: DURATION + value: 45 MIN + +spec: + artifactName: station-comfyui + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: View on GitHub + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-comfyui/ + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + ComfyUI is a node-based visual interface for building image and video generation workflows using diffusion models. Instead of a single text box, you connect processing nodes — model loaders, text encoders, samplers, decoders — into a graph that gives full control over every generation step. + + - **Node-based workflows** let you build, modify, and share complex generation pipelines visually. + - **Multi-model support** covers the latest architectures: FLUX for images, Wan 2.1 and HunyuanVideo for video, and NVIDIA Cosmos for world generation. + - **Full precision on GB300** — with 252 GB of HBM3e, you can run 12–17B image models and 13–14B video models at bf16 with no quantization or offloading, which is impossible on consumer hardware. + + # What you'll accomplish + + Deploy ComfyUI on DGX Station and run image and video generation workflows using six state-of-the-art models: + + - **FLUX.1 [dev]** (12B) — high-quality text-to-image generation + - **HiDream-I1 Full** (17B) — the largest open image model, with four text encoders including Llama-3.1-8B + - **Wan 2.1 T2V/I2V 14B** — text-to-video and image-to-video at 720p + - **HunyuanVideo** (13B) — 1080p video generation leveraging the full GB300 memory (~100–120 GB VRAM) + - **NVIDIA Cosmos-Predict2** (14B) — NVIDIA's world foundation model for video-to-world generation + + You will also learn advanced techniques including ControlNet-guided generation and combined image-to-video pipelines. + + # What to know before starting + + - Basic Docker container usage + - Familiarity with generative AI concepts (prompts, diffusion models) is helpful but not required + + # Prerequisites + + - NVIDIA DGX Station with GB300 GPU + - Docker installed: `docker --version` + - NVIDIA Container Toolkit configured: `nvidia-smi` should show the GB300 + - HuggingFace account with access token: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) + - At least 200 GB free disk space for model weights + - Network access to HuggingFace and GitHub + + # Ancillary files + + All required assets can be found [in the ComfyUI playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-comfyui/). + + - `assets/Dockerfile` — Builds the ComfyUI container image from NGC PyTorch base (ARM64) + - `assets/scripts/download-models.sh` — Downloads all model weights from Hugging Face using the **`hf`** CLI (`huggingface-hub` package) + - `assets/workflows/*.json` — Eight **UI** workflows (ComfyUI 0.4 graph with `nodes` / `links`) for **Load** in the web UI + - `assets/workflow_api/*.api.json` — The same eight graphs in **API** format for `/prompt` and automation (`curl`, scripts) + - `assets/scripts/api_to_ui_workflow.py` — Regenerates UI JSON from API JSON if you edit a graph programmatically + + # Time & risk + + * **Duration:** 45 minutes (excluding model downloads, which may take 30–60 minutes depending on network speed) + * **Risks:** + * Model downloads require HuggingFace authentication and substantial bandwidth (~150 GB total) + * Port 8188 must be accessible for the ComfyUI web interface + * **Rollback:** Stop and remove the Docker container. Delete the `models/` directory to reclaim disk space. + * **Last Updated:** 05/07/2026 + * Re-validated end-to-end on GB300: clean image build (`comfyui-gb300`, ~24 GB), container starts and serves on port 8188, all 8 mounted UI workflows enumerate correctly, `/object_info` returns 1092 node types, `/prompt` validation rejects on missing-model with clean errors. Documented benign startup warnings (`aimdo` CUDA-hook fallback, `urllib3` / `charset_normalizer` version skew) so users do not chase non-issues. + * 05/06/2026 — first publication; fixed walkthrough issues found on GB300: torchaudio shim for NGC PyTorch ABI mismatch, aarch64 onnxruntime swap, model-filename collisions (HiDream VAE → `ae_hidream.safetensors`, HunyuanVideo CLIP → `clip_l_hunyuan.safetensors`), `--gpus device=0` default, `df -h /` prereq, `~/.local/bin` PATH guidance, FLUX node list aligned with the actual graph, `.webp` output (not MP4), HF token via env not CLI, container output `chown` cleanup hint. + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Verify your environment + + Confirm Docker, GPU access, and available disk space. + + ```bash + docker --version + nvidia-smi + df -h / + ``` + + - **Docker**: Must be running (version 24+ recommended). + - **nvidia-smi**: Should list the GB300 GPU with 252 GB HBM3e. + - **Disk space**: At least 200 GB free on `/` for model weights and the Docker image. On DGX Station `/home` is on the root filesystem, so checking `/` covers both. You can download fewer models by choosing a tier (see Step 4). + + If you haven't already, add your user to the docker group: + + ```bash + sudo usermod -aG docker $USER + newgrp docker + ``` + + # Step 2. Set up environment variables + + Set your HuggingFace token so the download script and container can access gated models. + + ```bash + # HuggingFace token (required). Run this in the SAME shell that will + # launch `bash assets/scripts/download-models.sh` in Step 4 — the script + # reads $HF_TOKEN from the environment and exits early if it is unset. + # Get a token from https://huggingface.co/settings/tokens + export HF_TOKEN="your_huggingface_token" + ``` + + Some models (FLUX.1, HiDream-I1) require accepting the model license on HuggingFace before downloading. Visit each model page and click "Agree and access" if prompted: + - [FLUX.1 dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) + - [HiDream-I1 Full](https://huggingface.co/HiDream-ai/HiDream-I1-Full) + + # Step 3. Clone the playbook and build the container + + Clone the playbook repository and build the ComfyUI Docker image. The image is built on top of the NGC PyTorch container, which is already optimized for the GB300's ARM64 architecture. + + ```bash + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-comfyui + ``` + + Build the container image: + + ```bash + docker build -t comfyui-gb300 -f assets/Dockerfile . + ``` + + The build clones ComfyUI, installs dependencies (preserving the NGC-optimized PyTorch), and pre-installs custom nodes for video generation, ControlNet, and IP-Adapter. This takes approximately 5–10 minutes. + + # Step 4. Download models + + This playbook uses models organized into three tiers. Download only what you need, or download everything. + + | Tier | Models | Disk space | Peak VRAM (approx.) | Workflows enabled | + |------|--------|------------|---------------------|-------------------| + | **1 — Getting Started** | FLUX.1 dev, Wan 2.1 T2V 14B | ~70 GB | ~80 GB (Wan 720p clip) | Text-to-image, text-to-video | + | **2 — Intermediate** | + HiDream-I1, Wan 2.1 I2V, Cosmos-Predict2 | ~180 GB | ~100 GB (FLUX→Wan two-model graph) | + HiDream image gen, image-to-video, FLUX→Wan pipeline, Cosmos Video2World | + | **3 — Advanced** | + HunyuanVideo, FLUX ControlNet (Canny) | ~230 GB | ~120 GB (Hunyuan 1080p / long clips) | + 1080p video, ControlNet-guided generation | + + Peak VRAM depends on resolution, frame count, and precision; values above are **order-of-magnitude** for the default graphs in this playbook on a **GB300 (252 GB HBM3e)**. + + Install the Hugging Face Hub CLI (provides the **`hf`** command) if you do not already have it. The CLI installs to `~/.local/bin/`, which is **not on the default non-interactive PATH**, so add it before continuing: + + ```bash + pip3 install --break-system-packages huggingface-hub + echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc + export PATH="$HOME/.local/bin:$PATH" + hf --version # confirms PATH is correct + ``` + + Run the download script with your chosen tier (default downloads all): + + ```bash + # Download Tier 1 only (Getting Started): + bash assets/scripts/download-models.sh 1 + + # Download all tiers: + bash assets/scripts/download-models.sh + ``` + + Model downloads can take 30–60 minutes depending on network speed. The script uses the Hugging Face Hub **`hf download`** command (from the `huggingface_hub` package). If a download fails, the script **exits with an error** and prints which file was expected — check your token, network, and that you have accepted gated model licenses on Hugging Face. + + After Tier 1 completes, verify weights landed under `models/`: + + ```bash + ls -la ./models/diffusion_models/ + ls -la ./models/text_encoders/ | head + ``` + + # Step 5. Launch ComfyUI + + Start the ComfyUI container with all model and output directories mounted as volumes. On DGX Station, identify the GB300 GPU index with `nvidia-smi` and use `--gpus '"device=N"'` to target it. If the GB300 is your only GPU, `--gpus all` also works. + + ```bash + # Find the GB300 device index (look for "GB300" in the Name column) + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + + The default `--gpus '"device=0"'` works on single-GPU stations where the GB300 is index 0. If `nvidia-smi` reports the GB300 at a different index (for example index 1 on dual-GPU stations with an RTX PRO 6000 + GB300), substitute that index in the command below. + + ```bash + # device=0 by default; replace with the GB300 index from the command above + docker run -d \ + --name comfyui \ + --gpus '"device=0"' \ + --ipc host \ + --ulimit memlock=-1 \ + -p 8188:8188 \ + -v "$(pwd)/models:/opt/ComfyUI/models" \ + -v "$(pwd)/output:/opt/ComfyUI/output" \ + -v "$(pwd)/input:/opt/ComfyUI/input" \ + -v "$(pwd)/assets/workflows:/opt/ComfyUI/user/default/workflows" \ + -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ + comfyui-gb300 + ``` + + Check startup logs: + + ```bash + docker logs -f comfyui + ``` + + Expected output includes custom node loading messages and: + ``` + To see the GUI go to: http://0.0.0.0:8188 + ``` + + Press `Ctrl+C` to exit the log view. Open a web browser and navigate to `http://:8188` where `` is your DGX Station's IP address. + + > [!NOTE] + > The startup logs include several **benign** warnings you can ignore: `aimdo: ... funchook_prepare(cuMemFree_v2) failed` (NGC PyTorch's CUDA hooks tool falling back to no-op), `urllib3 / charset_normalizer doesn't match a supported version`, `torchaudio missing` (covered by the import-only stub — no playbook workflow uses audio VAE), `DWPose: Onnxruntime not found ... switch to OpenCV with CPU device` (aarch64 has no `onnxruntime-gpu` wheel; CPU preprocessing still works), and `accelerate / GPTQModel / optimum / bitsandbytes not installed` from the HiDream sampler. The real "ready" signal is the `To see the GUI go to: ...` line above; treat anything else as suspect. + + ### UI workflows vs API graphs (important) + + ComfyUI uses **two different JSON shapes**: + + | Location | Format | Use | + |----------|--------|-----| + | `assets/workflows/*.json` mounted at `user/default/workflows/` | **UI workflow** (has `"nodes"` and `"links"`) | **Load** in the web UI, edit in the canvas, then **Queue Prompt** | + | `assets/workflow_api/*.api.json` (on the host repo, not mounted into the default workflow folder) | **API prompt graph** (flat node ids → `class_type` / `inputs`) | **`POST /prompt`**, `curl`, automation | + + If you open an **`.api.json`** file with **Load**, the UI shows **"Error: the workflow does not contain any nodes"** — that is expected; those files are not UI workflows. + + **Optional — run the same graph via HTTP API** (from the playbook root, with ComfyUI listening on port 8188). Strip any non-node keys (for example `_comment` in some API files), minify to one line, and POST: + + ```bash + PROMPT=$(python3 -c "import json; d=json.load(open('assets/workflow_api/flux-text-to-image.api.json')); print(json.dumps({k:v for k,v in d.items() if str(k).isdigit()}, separators=(',',':')))") + curl -sS http://127.0.0.1:8188/prompt \ + -X POST \ + -H "Content-Type: application/json" \ + -d "{\"prompt\":${PROMPT}}" | python3 -m json.tool + ``` + + The response includes a `prompt_id` you can correlate with server logs and the `output/` folder. + + **ComfyUI interface orientation:** + - **Canvas** — The central area where you build and view node workflows. + - **Queue Prompt** — The button (top right) that runs the current workflow. + - **Load** — Load a **UI** workflow from `flux-text-to-image.json`, `wan-text-to-video.json`, etc. (listed in the workflow sidebar under the mounted folder). + - **Manager** — Access ComfyUI-Manager to install additional custom nodes. + + # Step 6. Image generation with FLUX.1 dev + + *Requires: Tier 1 models* + + Load the pre-built FLUX text-to-image workflow. In ComfyUI, click **Load** and select **`flux-text-to-image.json`** (UI format). Do **not** use the `*.api.json` files in `assets/workflow_api/` with Load — they are for the HTTP API only. + + **What this workflow does:** + + The workflow connects these nodes in sequence: + + 1. **UNETLoader** — Loads the FLUX.1 dev 12B transformer (~24 GB in bf16) with `weight_dtype=default`. + 2. **DualCLIPLoader** — Loads CLIP-L and T5-XXL text encoders that convert your prompt into conditioning vectors. + 3. **CLIP Text Encode** — Takes your text prompt and produces positive conditioning. + 4. **FluxGuidance** — Applies FLUX's guidance value (default 3.5) to the conditioning. + 5. **EmptySD3LatentImage** — Creates a blank latent at your chosen resolution (default: 1024x1024). + 6. **ModelSamplingFlux** + **BasicScheduler** + **KSamplerSelect** + **BasicGuider** + **RandomNoise** — Configure FLUX's flow-matching schedule (20 steps, `euler`/`simple`). + 7. **SamplerCustomAdvanced** — The diffusion sampling loop that denoises the latent. + 8. **VAE Decode** — Converts the latent back into a pixel image. + 9. **Save Image** — Writes the result to the `output/` directory. + + **Try it:** + + 1. Find the **CLIP Text Encode** node and enter a prompt, for example: `A majestic snow leopard resting on a cliff at golden hour, photorealistic, 8k detail` + 2. Click **Queue Prompt**. + 3. The image generates in approximately 15–30 seconds. Results appear in the `output/` directory and in the preview node. + + Experiment with different prompts, resolutions (512x512 up to 2048x2048), and step counts. FLUX.1 dev produces high-quality results even at 20 steps. + + # Step 7. Video generation with Wan 2.1 + + *Requires: Tier 1 models* + + Load `wan-text-to-video.json` from the workflow browser. + + **What this workflow does:** + + 1. **Load Diffusion Model** — Loads the Wan 2.1 T2V 14B model (~28 GB in bf16). + 2. **CLIPLoader** — Loads the UMT5-XXL text encoder for Wan. + 3. **CLIP Text Encode** — Encodes your video description prompt. + 4. **EmptyHunyuanLatentVideo** — Creates a blank video latent (default: 720p, 81 frames at ~16 fps ≈ 5 seconds). Wan reuses this latent format. + 5. **KSampler** — Diffusion sampling over the video latent. This is the slowest step — expect 3–5 minutes for a 5-second clip on the GB300. + 6. **VAE Decode** — Converts latents to video frames. + 7. **SaveAnimatedWEBP** — Encodes frames into an animated WEBP file. + + **Try it:** + + 1. Enter a prompt: `A drone shot flying over a misty mountain forest at sunrise, cinematic` + 2. Click **Queue Prompt**. + 3. Generation takes 3–10 minutes at 720p with 81 frames. Monitor GPU memory with `nvidia-smi` in another terminal — the 14B model at 720p uses approximately 65–80 GB of the GB300's 252 GB HBM3e. + 4. The output **`.webp`** (animated WEBP from `SaveAnimatedWEBP`) appears in the `output/` directory. To convert to MP4, use `ffmpeg -i output/wan_t2v_output_00001_.webp output/wan_t2v_output.mp4`. + + **Tips:** + - Reduce frame count (e.g., 49 frames ≈ 3 seconds) for faster iteration. + - Wan 2.1 responds well to cinematic, descriptive prompts with camera movement descriptions. + + # Step 8. Intermediate workflows + + *Requires: Tier 2 models* + + This step introduces four additional workflows. Each builds on the basics from Steps 6–7. + + ## HiDream-I1 image generation + + Load `hidream-text-to-image.json`. + + HiDream-I1 Full is a **17B parameter** image model that uses **four text encoders** — CLIP-L, CLIP-G, T5-XXL, and Llama-3.1-8B-Instruct. The Llama encoder gives it exceptional prompt understanding, especially for complex or nuanced descriptions. + + The full pipeline uses approximately **60–65 GB** in bf16 — well within the GB300's capacity but impossible on most GPUs. + + **Try it:** Use a detailed, complex prompt to see the difference from FLUX — for example: `An astronaut riding a horse on Mars, with Earth visible in the sky, oil painting style by Rembrandt, dramatic chiaroscuro lighting` + + ## Wan 2.1 image-to-video + + Load `wan-image-to-video.json`. + + This workflow takes an **input image** and animates it into a video clip. Place your source image in the `input/` directory before running. + + 1. The **LoadImage** node reads from `input/`. + 2. The **Wan 2.1 I2V 14B** model generates motion that is consistent with the source image. + + **Try it:** Generate an image with FLUX first (Step 6), copy it from `output/` to `input/`, then animate it. + + ## FLUX → Wan combined pipeline + + Load `flux-to-wan-pipeline.json`. + + This workflow chains two models in a single graph: + 1. **FLUX.1 dev** generates a high-quality still image from your text prompt. + 2. The image is passed directly to **Wan 2.1 I2V 14B**, which animates it into a video. + + This avoids manually moving files between workflows. Both models load into GPU memory simultaneously (~95 GB total in bf16). + + ## Cosmos-Predict2 Video2World + + Load `cosmos-text-to-video.json`. + + **NVIDIA Cosmos-Predict2 14B** is NVIDIA's world foundation model for Video2World generation. It takes an input image and generates a physically plausible video extending from that scene. Place your source image in the `input/` directory before running. + + The Cosmos VAE is extremely efficient — it can encode/decode 1280x704 at 121 frames without tiling. + + **Try it:** Use an image from a previous FLUX generation as the start frame, with a prompt describing the motion: `A red ball rolling down a wooden ramp and bouncing off a wall, physics simulation, realistic lighting` + + # Step 9. Advanced workflows + + *Requires: Tier 3 models* + + ## HunyuanVideo 1080p generation + + Load `hunyuan-1080p-video.json`. + + This is the **true GB300 showcase**. HunyuanVideo's 13B model generating at 1080p resolution uses approximately **100–120 GB of VRAM** — impossible on any consumer or professional GPU, but well within the GB300's 252 GB. + + - Default: 1920x1056, 49 frames (~3 seconds). Note: height must be divisible by 16 for HunyuanVideo's latent space, so 1056 is used instead of 1080. + - Generation time: 2–5 minutes for 49 frames, longer for more. + - Monitor with `nvidia-smi` — you should see 100+ GB GPU memory usage. + + **Try it:** `A time-lapse of cherry blossoms falling in a Japanese garden with a koi pond, 4K cinematic` + + ## ControlNet with FLUX + + Load `flux-controlnet.json`. + + ControlNet lets you **guide image generation with structural conditioning** — edges, depth maps, or pose skeletons extracted from a reference image. + + 1. Place a reference image in `input/`. + 2. The **Canny Edge Preprocessor** extracts edge structure from the reference. + 3. The **FLUX.1 Canny Dev** model (a full FLUX variant fine-tuned for canny conditioning) generates an image following that structure while applying the text prompt's style and content. + 4. Both the preprocessed canny image and the final output are saved for comparison. + + **Use cases:** Architectural visualization, consistent character poses, style transfer while preserving composition. + + # Step 10. Cleanup + + Stop and remove the ComfyUI container: + + ```bash + docker stop comfyui + docker rm comfyui + ``` + + > [!NOTE] + > Files in `output/` and `models/` are written by the container as root, so removing them from the host shell needs `sudo` (e.g. `sudo rm -rf models/`). To avoid this in future runs, add `--user "$(id -u):$(id -g)"` to the `docker run` command in Step 5 — note that this requires the host UID to have write access to all mounted directories. + + Optionally remove the Docker image: + + ```bash + docker rmi comfyui-gb300 + ``` + + Optionally remove downloaded models to reclaim disk space: + + ```bash + rm -rf models/ + ``` + + Generated images and videos in `output/` are preserved on the host regardless of container state. + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + # Common issues + + | Symptom | Cause | Fix | + |---------|-------|-----| + | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | + | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker | + | ComfyUI web UI not accessible | Firewall blocking port or wrong IP | Verify with `docker logs comfyui`, check that port 8188 is open, use `http://:8188` | + | "Model file not found" when running workflow | Model not downloaded or wrong path | Verify models are in `./models/` and the volume mount is correct in the docker run command | + | HuggingFace download fails with 401 | Invalid or missing HF token | Verify `HF_TOKEN` is exported and valid at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) | + | CUDA out of memory during video generation | Frame count or resolution too high | Reduce frame count or resolution. At 720p with Wan 2.1 14B, keep clips under 5 seconds initially | + | CUDA out of memory during 1080p HunyuanVideo | Model + video tensors exceed GPU memory | Use fewer frames (e.g., 49 instead of 97). HunyuanVideo at 1080p needs ~100-120 GB | + | Workflow loads but nodes show red "missing" | Custom node not installed | Use ComfyUI-Manager (click Manager → Install Missing Custom Nodes) or rebuild the Docker image | + | Video output is a black screen | VAE decode issue or wrong model variant | Ensure you are using the correct model variant (T2V vs I2V) and the VAE is loaded | + | Very slow generation, GPU utilization low | PyTorch not using GPU or wrong CUDA version | Run `nvidia-smi` inside container: `docker exec comfyui nvidia-smi`. Ensure GPU is visible | + | "No module named ..." error on startup | Custom node dependency not installed | Exec into container and install: `docker exec comfyui pip install ` then restart | + | Docker build fails on ARM64 with `Could not find a version that satisfies the requirement onnxruntime-gpu` | `onnxruntime-gpu` has no aarch64 wheel on PyPI | Already handled by the shipped Dockerfile, which `sed`-substitutes `onnxruntime-gpu` → `onnxruntime` (CPU build) in every custom_node `requirements.txt` before `pip install`. If you see this error, you are building from a Dockerfile predating that fix — pull the latest assets and rebuild. | + | Docker build fails on ARM64 (other packages) | Some custom-node dependencies have no aarch64 wheel | Find the failing package in the build log. The custom-node install loop is wrapped in `\|\| true`, so the build still completes but the affected node will be missing modules at runtime. Either skip the node (remove its directory from `custom_nodes/` in the Dockerfile clone block) or install via ComfyUI-Manager after launch with a manually built wheel. | + | NGC image pull requires authentication | NGC registry needs login | Run `docker login nvcr.io` with your NGC API key | + | `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` on startup | Using `--gpus all` on a multi-GPU system causes a PyTorch assertion | Use `--gpus '"device=N"'` to target the GB300 specifically (check index with `nvidia-smi`) | + | `No HiDream models available` warning on startup | HiDream custom node reports no models found | This is a warning, not an error. It clears once HiDream model files are downloaded (Tier 2) | + | Web UI: **"Error: the workflow does not contain any nodes"** when using **Load** | The file is **API** format (flat `node_id → {class_type, inputs}`), not a UI workflow | In the playbook, use **`assets/workflows/.json`** in the Load dialog (under **user/default/workflows** inside the container). For **`curl`** / HTTP API, use **`assets/workflow_api/.api.json`** inside `{"prompt": ...}`. | + | `huggingface-cli: command not found` or download script errors | Deprecated CLI name | Install `huggingface_hub` and use **`hf download`** (the script does this automatically). | + | Download script exits but `models/diffusion_models/` is empty | Silent failure in older scripts or wrong token | Re-run with `bash -x assets/scripts/download-models.sh 1`; confirm `HF_TOKEN` and license acceptance on Hugging Face. The script now **fails fast** if a file is missing after `hf download`. | + | Container exits on startup with **`ModuleNotFoundError: torchaudio`** | Container was built from a Dockerfile predating the torchaudio shim | Rebuild the image: `docker build -t comfyui-gb300 -f assets/Dockerfile .`. The shipped Dockerfile creates an import-only `torchaudio` stub (NGC PyTorch's custom NVFP4 ABI is incompatible with PyPI torchaudio wheels). Lightricks audio VAE workflows are not supported in this image; no other workflow needs torchaudio. | + | `OSError: ... undefined symbol: torch_dtype_float4_e2m1fn_x2` from torchaudio | Real torchaudio installed on top of NGC PyTorch | Same fix as above — rebuild from the shipped Dockerfile. Do **not** `pip install torchaudio` manually inside the container. | + | `DWPose: Onnxruntime not found or doesn't come with acceleration providers, switch to OpenCV with CPU device` | Expected on aarch64. PyPI has no `onnxruntime-gpu` wheel for arm64; the Dockerfile substitutes the CPU `onnxruntime` package | Informational warning, not an error. DWPose preprocessing runs on CPU (slower than GPU) but produces correct output. | + | `aimdo: ... funchook_prepare(cuMemFree_v2) failed: 8 Failed to allocate memory in unused regions` at startup | NGC PyTorch's CUDA-hooks diagnostic tool (`aimdo`) cannot install hooks under default container caps and falls back to no-op | Benign. ComfyUI works normally; the message is informational from the NGC base image. No action required. | + | `RequestsDependencyWarning: urllib3 (...) or charset_normalizer (...) doesn't match a supported version!` at startup | Version skew between `requests` and the NGC-pinned `urllib3` / `charset_normalizer` wheels | Benign. ComfyUI's HTTP traffic still works. Suppress with `PYTHONWARNINGS=ignore::requests.RequestsDependencyWarning` if it bothers you. | + + > [!NOTE] + > ComfyUI logs are visible with `docker logs -f comfyui`. Most errors (missing models, node failures) are reported in these logs with clear messages. + + + + + resources: + - name: ComfyUI (GitHub) + url: https://github.com/comfyanonymous/ComfyUI + + + - name: ComfyUI Examples + url: https://comfyanonymous.github.io/ComfyUI_examples/ + + + - name: FLUX.1 on HuggingFace + url: https://huggingface.co/black-forest-labs/FLUX.1-dev + + + - name: Wan 2.1 on HuggingFace + url: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B + + + - name: NVIDIA Cosmos-Predict2 + url: https://huggingface.co/nvidia/Cosmos-Predict2-14B-Text2Video + + diff --git a/nvidia/station-comfyui/endpoint-test.yaml b/nvidia/station-comfyui/endpoint-test.yaml index f4d0cb4..2af3948 100644 --- a/nvidia/station-comfyui/endpoint-test.yaml +++ b/nvidia/station-comfyui/endpoint-test.yaml @@ -82,7 +82,7 @@ spec: # Ancillary files - All required assets can be found [in the ComfyUI playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-comfyui/). + All required assets can be found in the [ComfyUI playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-comfyui/). - `assets/Dockerfile` — Builds the ComfyUI container image from NGC PyTorch base (ARM64) - `assets/scripts/download-models.sh` — Downloads all model weights from Hugging Face using the **`hf`** CLI (`huggingface-hub` package) @@ -97,9 +97,8 @@ spec: * Model downloads require HuggingFace authentication and substantial bandwidth (~150 GB total) * Port 8188 must be accessible for the ComfyUI web interface * **Rollback:** Stop and remove the Docker container. Delete the `models/` directory to reclaim disk space. - * **Last Updated:** 05/07/2026 - * Re-validated end-to-end on GB300: clean image build (`comfyui-gb300`, ~24 GB), container starts and serves on port 8188, all 8 mounted UI workflows enumerate correctly, `/object_info` returns 1092 node types, `/prompt` validation rejects on missing-model with clean errors. Documented benign startup warnings (`aimdo` CUDA-hook fallback, `urllib3` / `charset_normalizer` version skew) so users do not chase non-issues. - * 05/06/2026 — first publication; fixed walkthrough issues found on GB300: torchaudio shim for NGC PyTorch ABI mismatch, aarch64 onnxruntime swap, model-filename collisions (HiDream VAE → `ae_hidream.safetensors`, HunyuanVideo CLIP → `clip_l_hunyuan.safetensors`), `--gpus device=0` default, `df -h /` prereq, `~/.local/bin` PATH guidance, FLUX node list aligned with the actual graph, `.webp` output (not MP4), HF token via env not CLI, container output `chown` cleanup hint. + * **Last Updated:** 05/26/2026 + * First Publication @@ -368,7 +367,7 @@ spec: ## Cosmos-Predict2 Video2World - Load `cosmos-text-to-video.json`. + Load `cosmos-video2world.json`. **NVIDIA Cosmos-Predict2 14B** is NVIDIA's world foundation model for Video2World generation. It takes an input image and generates a physically plausible video extending from that scene. Place your source image in the `input/` directory before running. diff --git a/nvidia/station-gr00t/README.md b/nvidia/station-gr00t/README.md index 65fcb4c..f1f6474 100644 --- a/nvidia/station-gr00t/README.md +++ b/nvidia/station-gr00t/README.md @@ -7,16 +7,28 @@ - [Overview](#overview) - [Instructions](#instructions) + - [1a. Git LFS (required for a clean clone)](#1a-git-lfs-required-for-a-clean-clone) + - [1b. Clone and check out `n1.6-release`](#1b-clone-and-check-out-n16-release) + - [1c. Install Python dependencies](#1c-install-python-dependencies) - [Troubleshooting](#troubleshooting) + - [Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)](#issue-git-clone-fails-or-demo-videos-are-tiny-missing-git-lfs) + - [Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook](#issue-gr1-demodatagr1picknplace-or-scripts-do-not-match-the-playbook) + - [Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes](#issue-installdepssh-is-not-allowed-on-your-machine-policy-or-you-need-to-know-what-it-changes) + - [Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64](#issue-uv-sync-option-b-appears-stuck-for-hours-building-flash-attn-on-aarch64) - [Issue: `install_deps.sh` fails building torchcodec](#issue-installdepssh-fails-building-torchcodec) - [Issue: `huggingface-cli download` fails with 401 Unauthorized](#issue-huggingface-cli-download-fails-with-401-unauthorized) + - [Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`](#issue-huggingface-cli-download-fails-with-permission-denied-homecachehuggingfacehub) + - [Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint](#issue-huggingface-cli-download-returns-500-internal-server-error-from-the-xet-read-token-endpoint) + - [Issue: `externally-managed-environment` or `pip` installs not going into `.venv`](#issue-externally-managed-environment-or-pip-installs-not-going-into-venv) - [Issue: CUDA out of memory during fine-tuning](#issue-cuda-out-of-memory-during-fine-tuning) - - [Issue: Triton/PTXAS errors about `sm_103a` during inference](#issue-tritonptxas-errors-about-sm103a-during-inference) + - [Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)](#issue-triton-ptxas-errors-about-sm103a-gb300-blackwell) - [Issue: `ModuleNotFoundError: No module named 'gr00t'`](#issue-modulenotfounderror-no-module-named-gr00t) + - [Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`](#issue-notimplementederror-in-getframesbyindices-when-backend-is-pyav) + - [Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps](#issue-training-hangs-low-gpu-utilization-no-traceback-very-slow-steps) + - [Issue: Video decoding errors / `torchcodec` not found (general)](#issue-video-decoding-errors-torchcodec-not-found-general) - [Issue: Training loss is not decreasing](#issue-training-loss-is-not-decreasing) - [Issue: `nvidia-smi` shows the wrong GPU](#issue-nvidia-smi-shows-the-wrong-gpu) - - [Issue: Slow data loading during training](#issue-slow-data-loading-during-training) - - [Issue: Video decoding errors (`NotImplementedError` or torchcodec not found)](#issue-video-decoding-errors-notimplementederror-or-torchcodec-not-found) + - [Issue: OpenCV or decord cannot decode LIBERO AV1](#issue-opencv-or-decord-cannot-decode-libero-av1) --- @@ -24,57 +36,132 @@ ## Basic idea -NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning. +NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning. -In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of **128**, far exceeding the typical 32–64 used on smaller GPUs, which accelerates convergence and improves training throughput. +High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo: + +![GR00T N1.6 reference architecture](./assets/GR00T-reference-arch-diagram.png) + +*Source: [NVIDIA Isaac GR00T — `media/GR00T-reference-arch-diagram.png`](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/media/GR00T-reference-arch-diagram.png). If the local image above is missing, the upstream copy is at `https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png`.* + +In this playbook you will fine-tune GR00T N1.6 on the **LIBERO Spatial** benchmark on a **DGX Station** with **GB300** (large unified memory). That setup supports a high **global batch size (128)** on a single GPU, which improves training throughput compared to typical 24–80 GB consumer or datacenter GPUs. + +## LIBERO Spatial (what you are fine-tuning on) + +**LIBERO Spatial** is part of the [LIBERO](https://libero-project.github.io/main.html) suite of simulated tabletop manipulation benchmarks. The **spatial** split emphasizes **where** objects need to be placed: tasks such as putting a bowl on a **stove burner** vs a **plate**, placing utensils in a **mug** vs next to it, or moving objects to **left/right/front** targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots. + +## What kind of fine-tuning this playbook uses + +This playbook runs the **default Isaac GR00T fine-tuning recipe** from `launch_finetune.py`: **not** full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the **action head (DiT)** and **projector / adapter paths** that map observations into the action model, with strong **state dropout** and **color jitter** so the policy leans on vision. Optional flags such as `--tune-llm` or `--tune-visual` (mentioned under Next steps) trade compute and memory for updating more of the backbone. **LoRA** is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook. + +## NVIDIA DGX Station (why this hardware) + +**DGX Station** is a deskside AI system built for **large-memory GPU** training and inference (this playbook targets **GB300** with **284 GB HBM3e**). Beyond robotics, the same class of machine supports **large-model fine-tuning**, **RAG serving**, **multi-modal training**, and **CUDA research** where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting **much larger batch sizes** per GPU than on smaller cards, which stabilizes gradients and improves **samples per second** when the data pipeline keeps up. ## What you'll accomplish -- Set up the Isaac GR00T environment using `uv` (fast, reproducible Python packaging) -- Verify the pre-trained base model loads and runs inference -- Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128 -- Evaluate the fine-tuned model using open-loop evaluation and measure inference latency +- Check out the **`n1.6-release`** branch of Isaac GR00T so commands, embodiment tags, and `demo_data/` match GR00T **N1.6** +- Set up the environment with `uv` (project-local `.venv`) and understand what the optional `install_deps.sh` script changes on the system +- Apply the recommended **PyAV `get_frames_by_indices` patch** when `torchcodec` is unavailable so LIBERO **AV1** video decoding does not stall on an **ffmpeg** subprocess fallback +- Verify the base model, fine-tune on LIBERO Spatial at batch size **128**, run open-loop evaluation, and measure inference latency (with **GB300 / Blackwell** TorchDynamo compilation notes) ## What to know before starting -- Familiarity with Python virtual environments -- Familiarity with PyTorch training workflows (epochs, batch size, loss curves) -- General understanding of robot manipulation concepts (actions, observations, trajectories) +- Familiarity with Python virtual environments (`source .venv/bin/activate`) +- Familiarity with PyTorch training concepts (batch size, loss, checkpoints) +- Basic robot manipulation vocabulary (trajectories, observations, actions) +- Comfort running commands that may use **`sudo`** for system packages (or use the documented user-space alternative) ## Prerequisites -- NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e) -- CUDA toolkit installed: `nvcc --version` should show CUDA 12.8+ -- Git installed: `git --version` -- HuggingFace account with access token (for model and dataset downloads) -- Network access to HuggingFace, GitHub, and PyPI -- At least 30 GB of free disk space (venv + model + dataset) +- NVIDIA **DGX Station** with **GB300** (Blackwell SM103, 284 GB HBM3e) +- CUDA toolkit usable by PyTorch: `nvcc --version` should show **CUDA 12.8+** (often already under `/usr/local/cuda` on DGX images) +- **Git** and **Git LFS** (`git lfs version`) — LFS is required for some demo assets and submodules; install with `sudo apt-get install -y git-lfs` then `git lfs install` if missing +- Hugging Face account and **HF_TOKEN** for model and dataset downloads +- Network access to Hugging Face, GitHub, and PyPI +- At least **~30 GB** free disk for `.venv`, checkpoints, and the LIBERO download ## Time & risk -* **Duration:** ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation) -* **Risks:** Model download requires HuggingFace authentication; `uv sync` installs packages into a project-local `.venv` -* **Rollback:** Delete the cloned `Isaac-GR00T` directory to restore state. No system-level changes are made. -* **Last Updated:** 04/06/2026 +* **Duration:** ~45 minutes end-to-end when the video backend is healthy (setup, downloads, ~20–25 min training at 2000 steps, eval and inference) +* **Risks:** `scripts/deployment/dgpu/install_deps.sh` performs **system-level** `apt` operations and may install the **CUDA 12.8 toolkit** if `/usr/local/cuda` is absent (see Instructions). Model download requires Hugging Face authentication. +* **Rollback:** Remove the cloned `Isaac-GR00T` directory and optionally `rm -rf ~/.local/share/uv` if you want to reclaim `uv` caches. Reverting `apt`-installed packages is a separate admin task; the playbook does not uninstall them automatically. +* **Last Updated:** 05/26/2026 * First Publication ## Instructions ## Step 1. Clone Isaac GR00T and install dependencies -Clone the repository and run the dGPU install script. This uses `uv` for fast, reproducible dependency management and automatically detects the aarch64 architecture: +### 1a. Git LFS (required for a clean clone) + +If `git clone` fails with errors about **Git LFS** or missing pointer files, install and initialize LFS, then remove any partial `Isaac-GR00T` directory and clone again: + +```bash +sudo apt-get update +sudo apt-get install -y git-lfs +git lfs install +``` + +### 1b. Clone and check out `n1.6-release` + +The **`main`** branch tracks ongoing development (for example newer GR00T milestones) and **does not** always match this **N1.6** playbook. Embodiment tags such as **`GR1`**, paths like **`demo_data/gr1.PickNPlace`**, and tutorial scripts are aligned with the **`n1.6-release`** branch. ```bash git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T cd Isaac-GR00T +git fetch origin +git checkout n1.6-release +git submodule update --init --recursive +``` + +### 1c. Install Python dependencies + +#### Option A — `install_deps.sh` (matches upstream docs; uses `sudo`) + +This script is the supported path. It may make **system-level** changes: + +- Runs `apt-get update` and installs **`ffmpeg`** and **`libaio-dev`** +- If **`/usr/local/cuda`** is missing, adds the NVIDIA CUDA apt repository and installs **`cuda-toolkit-12-8`** +- Installs **`uv`** into your user account if needed, then runs **`uv sync`** and **`uv pip install -e .`** into the project **`.venv`** +- On **aarch64** only: installs FFmpeg **development** packages and **builds `torchcodec` from source** into `.venv` + +```bash I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh ``` -The install script: -- Installs system dependencies (`ffmpeg`, `libaio-dev`) -- Installs `uv` if not present -- Runs `uv sync` to create a `.venv` with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8) -- Builds `torchcodec` from source on aarch64 (required for video decoding) +#### Option B — User-space only (no `install_deps.sh`) + +Use this only when **CUDA 12.8+** is already installed, system **`ffmpeg`** / **`libaio-dev`** are already present, and your policy forbids the script's `apt` or CUDA steps. From the **Isaac-GR00T** repo root, install **`uv`** if needed, then: + +```bash +command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh +export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH" +export CUDA_HOME=/usr/local/cuda +uv sync +uv pip install -e . +``` + +You still need a working **video backend** for LIBERO (see Step 2). On aarch64, building **torchcodec** inside `.venv` without the script is possible but manual; see Troubleshooting. + +> [!IMPORTANT] +> **`PATH` and `CUDA_HOME` matter on multi-toolkit hosts.** If the system has both an old Ubuntu `nvidia-cuda-toolkit` package (`/usr/bin/nvcc` ≈ 12.0) and a current NVIDIA CUDA repo install (`/usr/local/cuda-13.x/bin/nvcc`), `uv` will pick whichever appears first on `PATH`. Putting `/usr/local/cuda/bin` first (and exporting `CUDA_HOME`) is required for `flash-attn`'s source build to find the matching toolkit. Verify with `nvcc --version` after the export. + +> [!WARNING] +> **`flash-attn` build on aarch64 takes ~2 hours from source.** The upstream `pyproject.toml` only lists pre-built `flash-attn==2.7.4.post1` wheels for **`x86_64`**; on aarch64 (Grace + GB300), `uv sync` falls back to compiling ~72 CUDA kernels from source. A faster route is to pin `flash-attn==2.8.1` and reuse the GitHub release's prebuilt aarch64 wheel: +> +> ```toml +> # In pyproject.toml under [project] dependencies: +> "flash-attn==2.8.1", +> +> # In [tool.uv.sources]: +> flash-attn = [ +> { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl", +> marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" }, +> ] +> ``` +> +> With this pin, `uv sync` finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run. Activate the virtual environment: @@ -85,15 +172,32 @@ source .venv/bin/activate Verify GPU access: ```bash -CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))" +CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))" ``` Expected output: `NVIDIA GB300` > [!NOTE] -> Replace `CUDA_VISIBLE_DEVICES=1` with the index of your GB300 GPU throughout this playbook. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` to find it. +> Examples in this playbook use **`CUDA_VISIBLE_DEVICES=0`** because the GB300 is at index `0` on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run `nvidia-smi --query-gpu=index,name --format=csv,noheader`, find the GB300 row, and substitute that index everywhere `CUDA_VISIBLE_DEVICES=0` appears below. -## Step 2. Set up HuggingFace authentication +## Step 2. PyAV patch for LIBERO video (strongly recommended) + +On many stacks **`torchcodec`** fails to import or build, the resolver falls back to **`pyav`**, and stock **`n1.6-release`** can raise **`NotImplementedError`** from `get_frames_by_indices` for the **`pyav`** backend (fallback order is already `torchcodec` → `decord` → `pyav` → `ffmpeg`). Without this patch, training may **appear hung**: GPU idle, no traceback, while **ffmpeg** spawns per-frame decode work on the CPU. + +From the **Isaac-GR00T repo root** with **`n1.6-release`** checked out and **`.venv` activated**: + +```bash +git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch +uv pip install av +``` + +If you copied `nvidia/station-gr00t/assets/patches/` into the Isaac-GR00T root instead, use `git apply assets/patches/001-pyav-get-frames-by-indices.patch`. + +Details and re-apply rules: `nvidia/station-gr00t/assets/patches/README.md`. + +After patching, repeated log lines such as `Video backend 'torchcodec' is not available, falling back to 'pyav'` are **expected** and noisy but not fatal. + +## Step 3. Set up HuggingFace authentication ```bash export HF_TOKEN="your_huggingface_token" @@ -101,7 +205,7 @@ export HF_TOKEN="your_huggingface_token" Get a token from https://huggingface.co/settings/tokens if you don't have one. -## Step 3. Download the dataset and model +## Step 4. Download the dataset and model Download the LIBERO Spatial dataset and the GR00T N1.6 base model: @@ -119,18 +223,33 @@ cp examples/LIBERO/modality.json \ huggingface-cli download nvidia/GR00T-N1.6-3B ``` +> [!NOTE] +> **HF cache permission errors:** If `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`, the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run: +> +> ```bash +> export HF_HOME=$HOME/hf_cache_gr00t +> ``` +> +> **Transient `xet-read-token` 500 errors:** Hugging Face's xet backend occasionally returns `500 Internal Server Error` for dataset downloads. Disable it: +> +> ```bash +> export HF_HUB_DISABLE_XET=1 +> ``` + Verify the dataset is ready: ```bash ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json ``` -## Step 4. Verify the base model loads and runs +**Expected result:** the command prints the full path to **`modality.json`** (and `ls` exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata. -Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification: +## Step 5. Verify the base model loads and runs + +Confirm the GR00T N1.6 base model loads and produces actions using the **GR1** demo shipped on **`n1.6-release`**: ```bash -CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \ +TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \ --model-path nvidia/GR00T-N1.6-3B \ --dataset-path demo_data/gr1.PickNPlace \ --embodiment-tag GR1 \ @@ -140,20 +259,19 @@ CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py --steps 32 ``` -You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run. +**`TORCHDYNAMO_DISABLE=1`** avoids **`torch.compile`** / Triton paths that can fail on GB300 with **`ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'`**. Keep it on all **`standalone_inference_script.py`** invocations in this playbook unless you have a Triton build that supports SM103. + +You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run. > [!NOTE] -> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. See Troubleshooting for details. +> The base model's pretrained processor does not include the **`LIBERO_PANDA`** embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the **base** checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark. -> [!NOTE] -> The base model's pretrained processor does not include the `LIBERO_PANDA` embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark. +## Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial -## Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial - -Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of **128** — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour. +Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of **128** — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput **when the dataloader keeps the GPU fed**. ```bash -CUDA_VISIBLE_DEVICES=1 python \ +CUDA_VISIBLE_DEVICES=0 python \ gr00t/experiment/launch_finetune.py \ --base-model-path nvidia/GR00T-N1.6-3B \ --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ @@ -172,29 +290,34 @@ CUDA_VISIBLE_DEVICES=1 python \ --dataloader-num-workers 4 ``` -Training runs for **2000 steps** at batch size 128 and takes approximately 20–25 minutes on the GB300. +If GPU utilization stays **near zero** for many minutes while the process is alive, suspect **video decoding** (see Step 2 patch and Troubleshooting). You can try **`--dataloader-num-workers 8`** if CPU cores are available. + +Training runs for **2000 steps** at batch size 128 and takes approximately **20–25 minutes** on GB300 when **`torchcodec`** is the active video backend. + +> [!IMPORTANT] +> **With the PyAV fallback (Step 2 patch + no torchcodec)**, expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to **2.5–3 hours**, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower `--max-steps` (e.g. `100`) and `--save-steps` (e.g. `50`); loss should still drop visibly (validated drop **1.07 → 0.63** in 100 steps in this playbook's GB300 run). If you need full-throughput training, build `torchcodec` from source (Troubleshooting → "Video decoding errors") or run **Option A** which builds it for you. > [!NOTE] -> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to **20,000 steps** by changing `--max-steps 20000`. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks. +> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published **97.65%** success rate on LIBERO Spatial, increase to **20,000 steps** (`--max-steps 20000`). Published settings used batch size **640** across **8** GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference. **What the training flags mean:** | Flag | Value | Purpose | |------|-------|---------| -| `--global-batch-size` | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. | -| `--state-dropout-prob` | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. | -| `--color-jitter-params` | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. | -| `--warmup-ratio` | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). | -| `--save-steps` | 500 | Saves a checkpoint every 500 steps. | +| `--global-batch-size` | 128 | Total samples per training step; enabled by GB300 memory. | +| `--state-dropout-prob` | 0.8 | Drops proprioceptive state 80% of the time so the model relies on vision. | +| `--color-jitter-params` | brightness/contrast/saturation/hue | Photometric augmentation for lighting robustness. | +| `--warmup-ratio` | 0.05 | Linear LR warmup over the first 5% of steps. | +| `--save-steps` | 500 | Checkpoint cadence under `output/libero_spatial_ft/`. | -Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the `loss` field decreasing over time. Checkpoints are saved every 500 steps to `output/libero_spatial_ft/`. +Monitor the Hugging Face **Trainer** `loss` in the terminal. Checkpoints land under `output/libero_spatial_ft/`. -## Step 6. Evaluate the fine-tuned model +## Step 7. Evaluate the fine-tuned model -Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset: +Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to **`/tmp/open_loop_eval/`**: ```bash -CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \ +CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \ --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ --embodiment-tag LIBERO_PANDA \ --model-path output/libero_spatial_ft/checkpoint-2000/ \ @@ -202,27 +325,17 @@ CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \ --action-horizon 16 ``` -The evaluation outputs: - -- **Per-trajectory MSE and MAE** printed to the terminal -- **Average MSE** across all evaluated trajectories -- **JPEG visualizations** saved to `/tmp/open_loop_eval/` showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper) - -Key things to look for in the plots: - -- **Predicted trajectories** (orange line) should closely track the **ground truth** (blue line) -- **Gripper timing** — opening and closing at the correct moments -- **Lower MSE** indicates better action prediction accuracy +**How to read the run:** the terminal prints **per-trajectory MSE/MAE** and **averages**. The JPEGs under **`/tmp/open_loop_eval/`** overlay **predicted** vs **ground-truth** trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks. > [!TIP] -> Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation. +> At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches **97.65%** in closed-loop sim. -## Step 7. Run inference timing benchmark +## Step 8. Run inference on a LIBERO sample (timing + actions) -Measure the fine-tuned model's per-step inference latency: +This step passes **LIBERO Spatial** observations through the **fine-tuned** checkpoint (the base model cannot run this embodiment). **`TORCHDYNAMO_DISABLE=1`** is included for GB300: ```bash -CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \ +TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \ --model-path output/libero_spatial_ft/checkpoint-2000/ \ --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ --embodiment-tag LIBERO_PANDA \ @@ -231,21 +344,9 @@ CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py --action-horizon 8 ``` -> [!NOTE] -> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. This runs inference in eager mode. See Troubleshooting for details. +**What to inspect:** the script prints a **timing breakdown** (data processing, backbone, action head, end-to-end). Compare **MSE/MAE** and latency to Step 5's base-model smoke test. In eager mode (with `TORCHDYNAMO_DISABLE=1`), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect **~3–4 s/step** on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned `checkpoint-100`); a compiled torch 2.7 + cu128 stack with Triton support for `sm_103` can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks. -The timing output breaks down into: - -- **Data processing** — loading and preprocessing the observation -- **Backbone** — vision-language model forward pass -- **Action head** — diffusion transformer denoising (4 steps) -- **End-to-end** — total inference time per action chunk - -In eager mode (without `torch.compile`), expect ~240 ms per step. With `torch.compile` working, expect ~38 ms per step comparable to H100. - -## Step 8. Clean up - -To remove the environment: +## Step 9. Clean up ```bash deactivate @@ -253,20 +354,82 @@ cd .. rm -rf Isaac-GR00T ``` -Your fine-tuned checkpoints in `output/libero_spatial_ft/` are deleted with the repo. Copy them elsewhere first if you want to keep them. +Fine-tuned checkpoints under `output/libero_spatial_ft/` are removed with the repo. Copy them elsewhere first if you want to keep them. ## Next steps -- **Increase training steps** — Change `--max-steps` to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps). -- **Try other LIBERO suites** — Download `libero_10_no_noops`, `libero_goal_no_noops`, or `libero_object_no_noops` datasets from the `IPEC-COMMUNITY` HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%. -- **Closed-loop simulation evaluation** — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the [LIBERO evaluation guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/examples/LIBERO/README.md#evaluate-checkpoint) for server-client setup. -- **Custom embodiments** — Fine-tune GR00T on your own robot data by following the [custom embodiment guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/finetune_new_embodiment.md). Requires converting your data to LeRobot v2 format and defining a modality config. -- **Experiment with batch size** — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling `--tune-llm` or `--tune-visual` increases memory usage significantly. +- **Increase training steps** — `--max-steps 20000` for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput). +- **Other LIBERO suites** — `libero_10_no_noops`, `libero_goal_no_noops`, `libero_object_no_noops` from **IPEC-COMMUNITY** on Hugging Face. +- **Closed-loop sim** — LIBERO sim server/client: [LIBERO evaluation in Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/examples/LIBERO/README.md#evaluate-checkpoint). +- **Custom embodiments** — [Fine-tune a new embodiment](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/getting_started/finetune_new_embodiment.md) (LeRobot v2 + modality JSON). +- **Tune more of the stack** — `--tune-llm` / `--tune-visual` raise memory use; probe batch size if you enable them. ## Troubleshooting ## Common Issues +### Issue: `git clone` fails or demo videos are tiny / missing (Git LFS) + +**Solution:** + +```bash +sudo apt-get install -y git-lfs +git lfs install +``` + +Remove any partial `Isaac-GR00T` directory, then clone again with `--recurse-submodules`. + +### Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook + +**Cause:** The repository default branch (**`main`**) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts. + +**Solution:** + +```bash +cd Isaac-GR00T +git fetch origin +git checkout n1.6-release +git submodule update --init --recursive +``` + +Always run playbook commands from **`n1.6-release`** for **N1.6** + **GR00T-N1.6-3B**. + +### Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes + +**Facts:** `scripts/deployment/dgpu/install_deps.sh` runs **`sudo apt-get`** to install **`ffmpeg`**, **`libaio-dev`**, and (on aarch64) FFmpeg **development** libraries for the **torchcodec** build. If **`/usr/local/cuda`** does not exist, it adds the NVIDIA CUDA apt repo and installs **`cuda-toolkit-12-8`**. It also installs **`uv`** into the user account if missing, then **`uv sync`** + **`uv pip install -e .`** into **`.venv`**. + +**Solution (policy-friendly):** Pre-install the same system packages and CUDA using your IT process, ensure **`nvcc`** works, then from the repo root: + +```bash +export PATH="$HOME/.local/bin:$PATH" +uv sync +uv pip install -e . +``` + +On **aarch64**, you still need **`torchcodec`** in `.venv` or rely on the **PyAV patch** (Instructions Step 2) plus **`uv pip install av`**. + +### Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64 + +**Cause:** Upstream `pyproject.toml` lists pre-built `flash-attn==2.7.4.post1` wheels only for `linux_x86_64`. On **aarch64** (Grace + GB300), `uv` falls back to a from-source build that compiles ~72 CUDA kernels — typically **~2 hours** end-to-end. + +**Solution:** Pin to `flash-attn==2.8.1` and use the GitHub release's prebuilt aarch64 wheel. Edit `pyproject.toml` in the repo root: + +```toml +## under [project] dependencies, replace: +## "flash-attn==2.7.4.post1", +"flash-attn==2.8.1", + +## under [tool.uv.sources], add: +flash-attn = [ + { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl", + marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" }, +] +``` + +The `cu12torch2.10` aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — `uv sync` completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in. + +If you must keep `flash-attn==2.7.4.post1` (Option A path), expect the 2-hour build on first sync; subsequent `uv sync` invocations re-use the cached wheel. + ### Issue: `install_deps.sh` fails building torchcodec **Solution:** @@ -277,19 +440,20 @@ Ensure the license confirmation env var is set: I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh ``` -If the build still fails, ensure FFmpeg dev libraries are installed: +If the build still fails, install FFmpeg development libraries: ```bash sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \ - libavcodec-dev libavutil-dev libswresample-dev libswscale-dev + libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \ + pkg-config cmake build-essential pybind11-dev ``` +Then apply **Instructions Step 2** (PyAV patch) so training does not depend on a working **torchcodec** for indexed frame reads. + ### Issue: `huggingface-cli download` fails with 401 Unauthorized **Solution:** -Verify your HuggingFace token is set and valid: - ```bash echo $HF_TOKEN huggingface-cli whoami @@ -301,116 +465,150 @@ If the token is not set: export HF_TOKEN="your_token_here" ``` -Make sure you have accepted any required model agreements on the HuggingFace model page. +Accept any required license or gated-model agreements on the Hugging Face model page. + +### Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'` + +**Cause:** The shared cache directory was previously created by a Docker container running as **root** (common on multi-user dev boxes that mount `~/.cache/huggingface` into containers without `--user`). The current user (`nvidia`) cannot write into it. + +**Solution:** point HF at a user-owned cache location for this run: + +```bash +export HF_HOME=$HOME/hf_cache_gr00t +mkdir -p "$HF_HOME" +huggingface-cli download nvidia/GR00T-N1.6-3B +``` + +Re-export `HF_HOME` for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown `~/.cache/huggingface` back to your user. + +### Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint + +**Cause:** Hugging Face's xet content-addressable backend occasionally returns transient `5xx`. This blocks dataset downloads even though the underlying files are reachable via the legacy backend. + +**Solution:** disable xet for the download: + +```bash +export HF_HUB_DISABLE_XET=1 +huggingface-cli download --repo-type dataset \ + IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \ + --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ +``` + +### Issue: `externally-managed-environment` or `pip` installs not going into `.venv` + +**Cause:** Debian/Ubuntu **PEP 668** blocks `pip install` onto the system Python. Mixing **`sudo pip`** with the project venv breaks the playbook. + +**Solution:** + +1. **`source .venv/bin/activate`** — prompt should show `(.venv)`. +2. Use **`uv pip install ...`** (or **`python -m pip install ...`**) **only** with the venv activated — never `sudo pip` for this project. +3. If the venv was created with a broken `pip`, recreate: `rm -rf .venv` and run **`uv sync`** again from the repo root (after `n1.6-release` checkout). ### Issue: CUDA out of memory during fine-tuning **Solution:** -If fine-tuning fails with an OOM error at batch size 128, reduce the batch size: +Reduce batch size: ```bash --global-batch-size 64 ``` -Also check that no other processes are using GPU memory: +Check for other GPU processes: `nvidia-smi`. **`--tune-llm`** / **`--tune-visual`** increase memory use substantially. -```bash -nvidia-smi -``` +### Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell) -If you are tuning additional model components (`--tune-llm` or `--tune-visual`), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient. +**Symptom:** -### Issue: Triton/PTXAS errors about `sm_103a` during inference - -**Solution:** - -The bundled Triton version may not yet support SM103 (GB300). This causes errors like: - -``` +```text ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name' ``` -Disable `torch.compile` by prepending: +**Solution:** + +For **`scripts/deployment/standalone_inference_script.py`** (which may use **`torch.compile`**), prepend: ```bash TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ... ``` -This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default. +This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and **`open_loop_eval.py`** typically run without this compile path; use the same prefix there **only** if you see the same crash. ### Issue: `ModuleNotFoundError: No module named 'gr00t'` **Solution:** -The virtual environment is not activated. Run: - ```bash source .venv/bin/activate +pwd # .../Isaac-GR00T ``` -Verify you are in the Isaac-GR00T directory: +### Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav` + +**Cause:** On **`n1.6-release`**, **`resolve_backend`** can select **`pyav`**, but stock **`get_frames_by_indices`** did not implement the **`pyav`** branch. + +**Solution:** Apply the playbook patch and install PyAV (see **Instructions Step 2** and `assets/patches/README.md`). + +### Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps + +**Cause:** Fallback to **per-frame `ffmpeg` subprocess** decoding for **AV1** LIBERO clips; dataloaders starve the GPU. + +**Solution:** + +1. Apply the **PyAV patch** (Step 2) and **`uv pip install av`**. +2. Optionally increase **`--dataloader-num-workers`** (for example **8**) if CPUs are free. + +**Expected noise after patching:** logs may repeat `Video backend 'torchcodec' is not available, falling back to 'pyav'` — that is normal if **torchcodec** is absent. + +### Issue: Video decoding errors / `torchcodec` not found (general) + +**Solution:** + +Prefer the **PyAV patch + `av`** path above for LIBERO on GB300. + +If you must build **torchcodec** into `.venv` manually (aarch64), with FFmpeg dev packages installed: ```bash -pwd -## Should show: .../Isaac-GR00T +## Run this from inside the Isaac-GR00T repo root (the directory that +## contains .venv). Capture its absolute path BEFORE changing directories +## so we can still reach the virtualenv after cd'ing into /tmp/torchcodec. +GR00T_ROOT="$(pwd)" + +## Sanity check — the virtualenv interpreter must already exist. +test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; } + +## Clone the torchcodec source into /tmp/torchcodec (skip if already cloned). +git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec +cd /tmp/torchcodec + +## Build torchcodec into the Isaac-GR00T virtualenv using the absolute +## path captured above (do NOT use the relative ".venv/bin/python" here — +## the current directory is /tmp/torchcodec, which has no .venv). +I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \ + uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation ``` +CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the **PyAV patch** instead. + ### Issue: Training loss is not decreasing **Solution:** -At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps: +At 2000 steps the model may still be early. If loss is flat after many steps: -1. Verify the dataset was downloaded correctly and the modality config was copied: - ```bash - ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json - ``` - -2. Check that the correct embodiment tag is used (`LIBERO_PANDA`, not `NEW_EMBODIMENT`). - -3. Try increasing the learning rate to `5e-4` for faster initial convergence on short runs. +1. Verify modality file: `ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json` +2. Confirm **`--embodiment-tag LIBERO_PANDA`** +3. Try **`--learning-rate 5e-4`** for faster early movement on short runs ### Issue: `nvidia-smi` shows the wrong GPU **Solution:** -On DGX Station, the GB300 may not be device 0. Find the correct index: - ```bash nvidia-smi --query-gpu=index,name --format=csv,noheader +CUDA_VISIBLE_DEVICES= python ... ``` -Use the GB300's index with `CUDA_VISIBLE_DEVICES`: +### Issue: OpenCV or decord cannot decode LIBERO AV1 -```bash -CUDA_VISIBLE_DEVICES=1 python ... -``` - -### Issue: Slow data loading during training - -**Solution:** - -Increase the number of dataloader workers: - -```bash ---dataloader-num-workers 8 -``` - -### Issue: Video decoding errors (`NotImplementedError` or torchcodec not found) - -**Solution:** - -The `install_deps.sh` script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall: - -```bash -sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \ - libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \ - pkg-config cmake build-essential pybind11-dev - -git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec -cd /tmp/torchcodec -I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \ - uv pip install --python .venv/bin/python . --no-build-isolation -cd - && rm -rf /tmp/torchcodec -``` +**Notes:** **OpenCV** often fails on **AV1** in LIBERO assets. **decord** may lack a compatible wheel for your platform. The **PyAV** patch path is the supported mitigation in this playbook. diff --git a/nvidia/station-gr00t/endpoint-production.yaml b/nvidia/station-gr00t/endpoint-production.yaml new file mode 100644 index 0000000..3592da4 --- /dev/null +++ b/nvidia/station-gr00t/endpoint-production.yaml @@ -0,0 +1,472 @@ +kind: Playbook +metadata: + name: station-gr00t + displayName: Isaac GR00T N1.6 Fine-Tuning + shortDescription: Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - GB300 + - Robotics + - Isaac GR00T + - Fine-Tuning + - Blackwell + - VLA + + attributes: + - key: DURATION + value: 45 MIN + +spec: + artifactName: station-gr00t + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: View on GitHub + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-gr00t/ + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning. + + In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of **128**, far exceeding the typical 32–64 used on smaller GPUs, which accelerates convergence and improves training throughput. + + # What you'll accomplish + + - Set up the Isaac GR00T environment using `uv` (fast, reproducible Python packaging) + - Verify the pre-trained base model loads and runs inference + - Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128 + - Evaluate the fine-tuned model using open-loop evaluation and measure inference latency + + # What to know before starting + + - Familiarity with Python virtual environments + - Familiarity with PyTorch training workflows (epochs, batch size, loss curves) + - General understanding of robot manipulation concepts (actions, observations, trajectories) + + # Prerequisites + + - NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e) + - CUDA toolkit installed: `nvcc --version` should show CUDA 12.8+ + - Git installed: `git --version` + - HuggingFace account with access token (for model and dataset downloads) + - Network access to HuggingFace, GitHub, and PyPI + - At least 30 GB of free disk space (venv + model + dataset) + + # Time & risk + + * **Duration:** ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation) + * **Risks:** Model download requires HuggingFace authentication; `uv sync` installs packages into a project-local `.venv` + * **Rollback:** Delete the cloned `Isaac-GR00T` directory to restore state. No system-level changes are made. + * **Last Updated:** 04/06/2026 + * First Publication + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Clone Isaac GR00T and install dependencies + + Clone the repository and run the dGPU install script. This uses `uv` for fast, reproducible dependency management and automatically detects the aarch64 architecture: + + ```bash + git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T + cd Isaac-GR00T + I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh + ``` + + The install script: + - Installs system dependencies (`ffmpeg`, `libaio-dev`) + - Installs `uv` if not present + - Runs `uv sync` to create a `.venv` with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8) + - Builds `torchcodec` from source on aarch64 (required for video decoding) + + Activate the virtual environment: + + ```bash + source .venv/bin/activate + ``` + + Verify GPU access: + + ```bash + CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))" + ``` + + Expected output: `NVIDIA GB300` + + > [!NOTE] + > Replace `CUDA_VISIBLE_DEVICES=1` with the index of your GB300 GPU throughout this playbook. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` to find it. + + # Step 2. Set up HuggingFace authentication + + ```bash + export HF_TOKEN="your_huggingface_token" + ``` + + Get a token from https://huggingface.co/settings/tokens if you don't have one. + + # Step 3. Download the dataset and model + + Download the LIBERO Spatial dataset and the GR00T N1.6 base model: + + ```bash + # Download LIBERO Spatial dataset (~2-3 GB) + huggingface-cli download \ + --repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \ + --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ + + # Copy the LIBERO modality config into the dataset's meta/ directory + cp examples/LIBERO/modality.json \ + examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/ + + # Download GR00T N1.6 base model (~6 GB) + huggingface-cli download nvidia/GR00T-N1.6-3B + ``` + + Verify the dataset is ready: + + ```bash + ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json + ``` + + # Step 4. Verify the base model loads and runs + + Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification: + + ```bash + CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \ + --model-path nvidia/GR00T-N1.6-3B \ + --dataset-path demo_data/gr1.PickNPlace \ + --embodiment-tag GR1 \ + --traj-ids 0 \ + --inference-mode pytorch \ + --action-horizon 8 \ + --steps 32 + ``` + + You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run. + + > [!NOTE] + > If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. See Troubleshooting for details. + + > [!NOTE] + > The base model's pretrained processor does not include the `LIBERO_PANDA` embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark. + + # Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial + + Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of **128** — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour. + + ```bash + CUDA_VISIBLE_DEVICES=1 python \ + gr00t/experiment/launch_finetune.py \ + --base-model-path nvidia/GR00T-N1.6-3B \ + --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ + --embodiment-tag LIBERO_PANDA \ + --num-gpus 1 \ + --output-dir output/libero_spatial_ft \ + --save-steps 500 \ + --save-total-limit 5 \ + --max-steps 2000 \ + --global-batch-size 128 \ + --learning-rate 1e-4 \ + --warmup-ratio 0.05 \ + --weight-decay 1e-5 \ + --state-dropout-prob 0.8 \ + --color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \ + --dataloader-num-workers 4 + ``` + + Training runs for **2000 steps** at batch size 128 and takes approximately 20–25 minutes on the GB300. + + > [!NOTE] + > This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to **20,000 steps** by changing `--max-steps 20000`. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks. + + **What the training flags mean:** + + | Flag | Value | Purpose | + |------|-------|---------| + | `--global-batch-size` | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. | + | `--state-dropout-prob` | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. | + | `--color-jitter-params` | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. | + | `--warmup-ratio` | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). | + | `--save-steps` | 500 | Saves a checkpoint every 500 steps. | + + Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the `loss` field decreasing over time. Checkpoints are saved every 500 steps to `output/libero_spatial_ft/`. + + # Step 6. Evaluate the fine-tuned model + + Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset: + + ```bash + CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \ + --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ + --embodiment-tag LIBERO_PANDA \ + --model-path output/libero_spatial_ft/checkpoint-2000/ \ + --traj-ids 0 1 2 \ + --action-horizon 16 + ``` + + The evaluation outputs: + + - **Per-trajectory MSE and MAE** printed to the terminal + - **Average MSE** across all evaluated trajectories + - **JPEG visualizations** saved to `/tmp/open_loop_eval/` showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper) + + Key things to look for in the plots: + + - **Predicted trajectories** (orange line) should closely track the **ground truth** (blue line) + - **Gripper timing** — opening and closing at the correct moments + - **Lower MSE** indicates better action prediction accuracy + + > [!TIP] + > Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation. + + # Step 7. Run inference timing benchmark + + Measure the fine-tuned model's per-step inference latency: + + ```bash + CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \ + --model-path output/libero_spatial_ft/checkpoint-2000/ \ + --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ + --embodiment-tag LIBERO_PANDA \ + --traj-ids 0 \ + --inference-mode pytorch \ + --action-horizon 8 + ``` + + > [!NOTE] + > If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. This runs inference in eager mode. See Troubleshooting for details. + + The timing output breaks down into: + + - **Data processing** — loading and preprocessing the observation + - **Backbone** — vision-language model forward pass + - **Action head** — diffusion transformer denoising (4 steps) + - **End-to-end** — total inference time per action chunk + + In eager mode (without `torch.compile`), expect ~240 ms per step. With `torch.compile` working, expect ~38 ms per step comparable to H100. + + # Step 8. Clean up + + To remove the environment: + + ```bash + deactivate + cd .. + rm -rf Isaac-GR00T + ``` + + Your fine-tuned checkpoints in `output/libero_spatial_ft/` are deleted with the repo. Copy them elsewhere first if you want to keep them. + + # Next steps + + - **Increase training steps** — Change `--max-steps` to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps). + - **Try other LIBERO suites** — Download `libero_10_no_noops`, `libero_goal_no_noops`, or `libero_object_no_noops` datasets from the `IPEC-COMMUNITY` HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%. + - **Closed-loop simulation evaluation** — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the [LIBERO evaluation guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/examples/LIBERO/README.md#evaluate-checkpoint) for server-client setup. + - **Custom embodiments** — Fine-tune GR00T on your own robot data by following the [custom embodiment guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/finetune_new_embodiment.md). Requires converting your data to LeRobot v2 format and defining a modality config. + - **Experiment with batch size** — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling `--tune-llm` or `--tune-visual` increases memory usage significantly. + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + # Common Issues + + ## Issue: `install_deps.sh` fails building torchcodec + + **Solution:** + + Ensure the license confirmation env var is set: + + ```bash + I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh + ``` + + If the build still fails, ensure FFmpeg dev libraries are installed: + + ```bash + sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \ + libavcodec-dev libavutil-dev libswresample-dev libswscale-dev + ``` + + ## Issue: `huggingface-cli download` fails with 401 Unauthorized + + **Solution:** + + Verify your HuggingFace token is set and valid: + + ```bash + echo $HF_TOKEN + huggingface-cli whoami + ``` + + If the token is not set: + + ```bash + export HF_TOKEN="your_token_here" + ``` + + Make sure you have accepted any required model agreements on the HuggingFace model page. + + ## Issue: CUDA out of memory during fine-tuning + + **Solution:** + + If fine-tuning fails with an OOM error at batch size 128, reduce the batch size: + + ```bash + --global-batch-size 64 + ``` + + Also check that no other processes are using GPU memory: + + ```bash + nvidia-smi + ``` + + If you are tuning additional model components (`--tune-llm` or `--tune-visual`), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient. + + ## Issue: Triton/PTXAS errors about `sm_103a` during inference + + **Solution:** + + The bundled Triton version may not yet support SM103 (GB300). This causes errors like: + + ``` + ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name' + ``` + + Disable `torch.compile` by prepending: + + ```bash + TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ... + ``` + + This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default. + + ## Issue: `ModuleNotFoundError: No module named 'gr00t'` + + **Solution:** + + The virtual environment is not activated. Run: + + ```bash + source .venv/bin/activate + ``` + + Verify you are in the Isaac-GR00T directory: + + ```bash + pwd + # Should show: .../Isaac-GR00T + ``` + + ## Issue: Training loss is not decreasing + + **Solution:** + + At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps: + + 1. Verify the dataset was downloaded correctly and the modality config was copied: + ```bash + ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json + ``` + + 2. Check that the correct embodiment tag is used (`LIBERO_PANDA`, not `NEW_EMBODIMENT`). + + 3. Try increasing the learning rate to `5e-4` for faster initial convergence on short runs. + + ## Issue: `nvidia-smi` shows the wrong GPU + + **Solution:** + + On DGX Station, the GB300 may not be device 0. Find the correct index: + + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + + Use the GB300's index with `CUDA_VISIBLE_DEVICES`: + + ```bash + CUDA_VISIBLE_DEVICES=1 python ... + ``` + + ## Issue: Slow data loading during training + + **Solution:** + + Increase the number of dataloader workers: + + ```bash + --dataloader-num-workers 8 + ``` + + ## Issue: Video decoding errors (`NotImplementedError` or torchcodec not found) + + **Solution:** + + The `install_deps.sh` script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall: + + ```bash + sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \ + libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \ + pkg-config cmake build-essential pybind11-dev + + git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec + cd /tmp/torchcodec + I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \ + uv pip install --python .venv/bin/python . --no-build-isolation + cd - && rm -rf /tmp/torchcodec + ``` + + + + + resources: + - name: Isaac GR00T (GitHub) + url: https://github.com/NVIDIA/Isaac-GR00T + + + - name: GR00T N1.6 Model (HuggingFace) + url: https://huggingface.co/nvidia/GR00T-N1.6-3B + + + - name: GR00T N1.6 Research Blog + url: https://research.nvidia.com/labs/gear/gr00t-n1_6/ + + + - name: GR00T N1.6 Paper + url: https://arxiv.org/abs/2503.14734 + + + - name: LIBERO Benchmark + url: https://libero-project.github.io/main.html + + diff --git a/nvidia/station-gr00t/endpoint-test.yaml b/nvidia/station-gr00t/endpoint-test.yaml index 3592da4..607a0cf 100644 --- a/nvidia/station-gr00t/endpoint-test.yaml +++ b/nvidia/station-gr00t/endpoint-test.yaml @@ -45,38 +45,57 @@ spec: content: | # Basic idea - NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning. + NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning. - In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of **128**, far exceeding the typical 32–64 used on smaller GPUs, which accelerates convergence and improves training throughput. + High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo: + + ![GR00T N1.6 reference architecture](./assets/GR00T-reference-arch-diagram.png) + + *Source: [NVIDIA Isaac GR00T — `media/GR00T-reference-arch-diagram.png`](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/media/GR00T-reference-arch-diagram.png). If the local image above is missing, the upstream copy is at `https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png`.* + + In this playbook you will fine-tune GR00T N1.6 on the **LIBERO Spatial** benchmark on a **DGX Station** with **GB300** (large unified memory). That setup supports a high **global batch size (128)** on a single GPU, which improves training throughput compared to typical 24–80 GB consumer or datacenter GPUs. + + # LIBERO Spatial (what you are fine-tuning on) + + **LIBERO Spatial** is part of the [LIBERO](https://libero-project.github.io/main.html) suite of simulated tabletop manipulation benchmarks. The **spatial** split emphasizes **where** objects need to be placed: tasks such as putting a bowl on a **stove burner** vs a **plate**, placing utensils in a **mug** vs next to it, or moving objects to **left/right/front** targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots. + + # What kind of fine-tuning this playbook uses + + This playbook runs the **default Isaac GR00T fine-tuning recipe** from `launch_finetune.py`: **not** full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the **action head (DiT)** and **projector / adapter paths** that map observations into the action model, with strong **state dropout** and **color jitter** so the policy leans on vision. Optional flags such as `--tune-llm` or `--tune-visual` (mentioned under Next steps) trade compute and memory for updating more of the backbone. **LoRA** is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook. + + # NVIDIA DGX Station (why this hardware) + + **DGX Station** is a deskside AI system built for **large-memory GPU** training and inference (this playbook targets **GB300** with **284 GB HBM3e**). Beyond robotics, the same class of machine supports **large-model fine-tuning**, **RAG serving**, **multi-modal training**, and **CUDA research** where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting **much larger batch sizes** per GPU than on smaller cards, which stabilizes gradients and improves **samples per second** when the data pipeline keeps up. # What you'll accomplish - - Set up the Isaac GR00T environment using `uv` (fast, reproducible Python packaging) - - Verify the pre-trained base model loads and runs inference - - Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128 - - Evaluate the fine-tuned model using open-loop evaluation and measure inference latency + - Check out the **`n1.6-release`** branch of Isaac GR00T so commands, embodiment tags, and `demo_data/` match GR00T **N1.6** + - Set up the environment with `uv` (project-local `.venv`) and understand what the optional `install_deps.sh` script changes on the system + - Apply the recommended **PyAV `get_frames_by_indices` patch** when `torchcodec` is unavailable so LIBERO **AV1** video decoding does not stall on an **ffmpeg** subprocess fallback + - Verify the base model, fine-tune on LIBERO Spatial at batch size **128**, run open-loop evaluation, and measure inference latency (with **GB300 / Blackwell** TorchDynamo compilation notes) # What to know before starting - - Familiarity with Python virtual environments - - Familiarity with PyTorch training workflows (epochs, batch size, loss curves) - - General understanding of robot manipulation concepts (actions, observations, trajectories) + - Familiarity with Python virtual environments (`source .venv/bin/activate`) + - Familiarity with PyTorch training concepts (batch size, loss, checkpoints) + - Basic robot manipulation vocabulary (trajectories, observations, actions) + - Comfort running commands that may use **`sudo`** for system packages (or use the documented user-space alternative) # Prerequisites - - NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e) - - CUDA toolkit installed: `nvcc --version` should show CUDA 12.8+ - - Git installed: `git --version` - - HuggingFace account with access token (for model and dataset downloads) - - Network access to HuggingFace, GitHub, and PyPI - - At least 30 GB of free disk space (venv + model + dataset) + - NVIDIA **DGX Station** with **GB300** (Blackwell SM103, 284 GB HBM3e) + - CUDA toolkit usable by PyTorch: `nvcc --version` should show **CUDA 12.8+** (often already under `/usr/local/cuda` on DGX images) + - **Git** and **Git LFS** (`git lfs version`) — LFS is required for some demo assets and submodules; install with `sudo apt-get install -y git-lfs` then `git lfs install` if missing + - Hugging Face account and **HF_TOKEN** for model and dataset downloads + - Network access to Hugging Face, GitHub, and PyPI + - At least **~30 GB** free disk for `.venv`, checkpoints, and the LIBERO download # Time & risk - * **Duration:** ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation) - * **Risks:** Model download requires HuggingFace authentication; `uv sync` installs packages into a project-local `.venv` - * **Rollback:** Delete the cloned `Isaac-GR00T` directory to restore state. No system-level changes are made. - * **Last Updated:** 04/06/2026 + * **Duration:** ~45 minutes end-to-end when the video backend is healthy (setup, downloads, ~20–25 min training at 2000 steps, eval and inference) + * **Risks:** `scripts/deployment/dgpu/install_deps.sh` performs **system-level** `apt` operations and may install the **CUDA 12.8 toolkit** if `/usr/local/cuda` is absent (see Instructions). Model download requires Hugging Face authentication. + * **Rollback:** Remove the cloned `Isaac-GR00T` directory and optionally `rm -rf ~/.local/share/uv` if you want to reclaim `uv` caches. Reverting `apt`-installed packages is a separate admin task; the playbook does not uninstall them automatically. + * **Last Updated:** 05/26/2026 * First Publication @@ -88,19 +107,75 @@ spec: content: | # Step 1. Clone Isaac GR00T and install dependencies - Clone the repository and run the dGPU install script. This uses `uv` for fast, reproducible dependency management and automatically detects the aarch64 architecture: + ## 1a. Git LFS (required for a clean clone) + + If `git clone` fails with errors about **Git LFS** or missing pointer files, install and initialize LFS, then remove any partial `Isaac-GR00T` directory and clone again: + + ```bash + sudo apt-get update + sudo apt-get install -y git-lfs + git lfs install + ``` + + ## 1b. Clone and check out `n1.6-release` + + The **`main`** branch tracks ongoing development (for example newer GR00T milestones) and **does not** always match this **N1.6** playbook. Embodiment tags such as **`GR1`**, paths like **`demo_data/gr1.PickNPlace`**, and tutorial scripts are aligned with the **`n1.6-release`** branch. ```bash git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T cd Isaac-GR00T + git fetch origin + git checkout n1.6-release + git submodule update --init --recursive + ``` + + ## 1c. Install Python dependencies + + ### Option A — `install_deps.sh` (matches upstream docs; uses `sudo`) + + This script is the supported path. It may make **system-level** changes: + + - Runs `apt-get update` and installs **`ffmpeg`** and **`libaio-dev`** + - If **`/usr/local/cuda`** is missing, adds the NVIDIA CUDA apt repository and installs **`cuda-toolkit-12-8`** + - Installs **`uv`** into your user account if needed, then runs **`uv sync`** and **`uv pip install -e .`** into the project **`.venv`** + - On **aarch64** only: installs FFmpeg **development** packages and **builds `torchcodec` from source** into `.venv` + + ```bash I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh ``` - The install script: - - Installs system dependencies (`ffmpeg`, `libaio-dev`) - - Installs `uv` if not present - - Runs `uv sync` to create a `.venv` with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8) - - Builds `torchcodec` from source on aarch64 (required for video decoding) + ### Option B — User-space only (no `install_deps.sh`) + + Use this only when **CUDA 12.8+** is already installed, system **`ffmpeg`** / **`libaio-dev`** are already present, and your policy forbids the script's `apt` or CUDA steps. From the **Isaac-GR00T** repo root, install **`uv`** if needed, then: + + ```bash + command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh + export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH" + export CUDA_HOME=/usr/local/cuda + uv sync + uv pip install -e . + ``` + + You still need a working **video backend** for LIBERO (see Step 2). On aarch64, building **torchcodec** inside `.venv` without the script is possible but manual; see Troubleshooting. + + > [!IMPORTANT] + > **`PATH` and `CUDA_HOME` matter on multi-toolkit hosts.** If the system has both an old Ubuntu `nvidia-cuda-toolkit` package (`/usr/bin/nvcc` ≈ 12.0) and a current NVIDIA CUDA repo install (`/usr/local/cuda-13.x/bin/nvcc`), `uv` will pick whichever appears first on `PATH`. Putting `/usr/local/cuda/bin` first (and exporting `CUDA_HOME`) is required for `flash-attn`'s source build to find the matching toolkit. Verify with `nvcc --version` after the export. + + > [!WARNING] + > **`flash-attn` build on aarch64 takes ~2 hours from source.** The upstream `pyproject.toml` only lists pre-built `flash-attn==2.7.4.post1` wheels for **`x86_64`**; on aarch64 (Grace + GB300), `uv sync` falls back to compiling ~72 CUDA kernels from source. A faster route is to pin `flash-attn==2.8.1` and reuse the GitHub release's prebuilt aarch64 wheel: + > + > ```toml + > # In pyproject.toml under [project] dependencies: + > "flash-attn==2.8.1", + > + > # In [tool.uv.sources]: + > flash-attn = [ + > { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl", + > marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" }, + > ] + > ``` + > + > With this pin, `uv sync` finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run. Activate the virtual environment: @@ -111,15 +186,32 @@ spec: Verify GPU access: ```bash - CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))" + CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))" ``` Expected output: `NVIDIA GB300` > [!NOTE] - > Replace `CUDA_VISIBLE_DEVICES=1` with the index of your GB300 GPU throughout this playbook. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` to find it. + > Examples in this playbook use **`CUDA_VISIBLE_DEVICES=0`** because the GB300 is at index `0` on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run `nvidia-smi --query-gpu=index,name --format=csv,noheader`, find the GB300 row, and substitute that index everywhere `CUDA_VISIBLE_DEVICES=0` appears below. - # Step 2. Set up HuggingFace authentication + # Step 2. PyAV patch for LIBERO video (strongly recommended) + + On many stacks **`torchcodec`** fails to import or build, the resolver falls back to **`pyav`**, and stock **`n1.6-release`** can raise **`NotImplementedError`** from `get_frames_by_indices` for the **`pyav`** backend (fallback order is already `torchcodec` → `decord` → `pyav` → `ffmpeg`). Without this patch, training may **appear hung**: GPU idle, no traceback, while **ffmpeg** spawns per-frame decode work on the CPU. + + From the **Isaac-GR00T repo root** with **`n1.6-release`** checked out and **`.venv` activated**: + + ```bash + git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch + uv pip install av + ``` + + If you copied `nvidia/station-gr00t/assets/patches/` into the Isaac-GR00T root instead, use `git apply assets/patches/001-pyav-get-frames-by-indices.patch`. + + Details and re-apply rules: `nvidia/station-gr00t/assets/patches/README.md`. + + After patching, repeated log lines such as `Video backend 'torchcodec' is not available, falling back to 'pyav'` are **expected** and noisy but not fatal. + + # Step 3. Set up HuggingFace authentication ```bash export HF_TOKEN="your_huggingface_token" @@ -127,7 +219,7 @@ spec: Get a token from https://huggingface.co/settings/tokens if you don't have one. - # Step 3. Download the dataset and model + # Step 4. Download the dataset and model Download the LIBERO Spatial dataset and the GR00T N1.6 base model: @@ -145,18 +237,33 @@ spec: huggingface-cli download nvidia/GR00T-N1.6-3B ``` + > [!NOTE] + > **HF cache permission errors:** If `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`, the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run: + > + > ```bash + > export HF_HOME=$HOME/hf_cache_gr00t + > ``` + > + > **Transient `xet-read-token` 500 errors:** Hugging Face's xet backend occasionally returns `500 Internal Server Error` for dataset downloads. Disable it: + > + > ```bash + > export HF_HUB_DISABLE_XET=1 + > ``` + Verify the dataset is ready: ```bash ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json ``` - # Step 4. Verify the base model loads and runs + **Expected result:** the command prints the full path to **`modality.json`** (and `ls` exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata. - Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification: + # Step 5. Verify the base model loads and runs + + Confirm the GR00T N1.6 base model loads and produces actions using the **GR1** demo shipped on **`n1.6-release`**: ```bash - CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \ + TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \ --model-path nvidia/GR00T-N1.6-3B \ --dataset-path demo_data/gr1.PickNPlace \ --embodiment-tag GR1 \ @@ -166,20 +273,19 @@ spec: --steps 32 ``` - You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run. + **`TORCHDYNAMO_DISABLE=1`** avoids **`torch.compile`** / Triton paths that can fail on GB300 with **`ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'`**. Keep it on all **`standalone_inference_script.py`** invocations in this playbook unless you have a Triton build that supports SM103. + + You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run. > [!NOTE] - > If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. See Troubleshooting for details. + > The base model's pretrained processor does not include the **`LIBERO_PANDA`** embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the **base** checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark. - > [!NOTE] - > The base model's pretrained processor does not include the `LIBERO_PANDA` embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark. + # Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial - # Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial - - Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of **128** — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour. + Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of **128** — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput **when the dataloader keeps the GPU fed**. ```bash - CUDA_VISIBLE_DEVICES=1 python \ + CUDA_VISIBLE_DEVICES=0 python \ gr00t/experiment/launch_finetune.py \ --base-model-path nvidia/GR00T-N1.6-3B \ --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ @@ -198,29 +304,34 @@ spec: --dataloader-num-workers 4 ``` - Training runs for **2000 steps** at batch size 128 and takes approximately 20–25 minutes on the GB300. + If GPU utilization stays **near zero** for many minutes while the process is alive, suspect **video decoding** (see Step 2 patch and Troubleshooting). You can try **`--dataloader-num-workers 8`** if CPU cores are available. + + Training runs for **2000 steps** at batch size 128 and takes approximately **20–25 minutes** on GB300 when **`torchcodec`** is the active video backend. + + > [!IMPORTANT] + > **With the PyAV fallback (Step 2 patch + no torchcodec)**, expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to **2.5–3 hours**, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower `--max-steps` (e.g. `100`) and `--save-steps` (e.g. `50`); loss should still drop visibly (validated drop **1.07 → 0.63** in 100 steps in this playbook's GB300 run). If you need full-throughput training, build `torchcodec` from source (Troubleshooting → "Video decoding errors") or run **Option A** which builds it for you. > [!NOTE] - > This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to **20,000 steps** by changing `--max-steps 20000`. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks. + > This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published **97.65%** success rate on LIBERO Spatial, increase to **20,000 steps** (`--max-steps 20000`). Published settings used batch size **640** across **8** GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference. **What the training flags mean:** | Flag | Value | Purpose | |------|-------|---------| - | `--global-batch-size` | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. | - | `--state-dropout-prob` | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. | - | `--color-jitter-params` | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. | - | `--warmup-ratio` | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). | - | `--save-steps` | 500 | Saves a checkpoint every 500 steps. | + | `--global-batch-size` | 128 | Total samples per training step; enabled by GB300 memory. | + | `--state-dropout-prob` | 0.8 | Drops proprioceptive state 80% of the time so the model relies on vision. | + | `--color-jitter-params` | brightness/contrast/saturation/hue | Photometric augmentation for lighting robustness. | + | `--warmup-ratio` | 0.05 | Linear LR warmup over the first 5% of steps. | + | `--save-steps` | 500 | Checkpoint cadence under `output/libero_spatial_ft/`. | - Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the `loss` field decreasing over time. Checkpoints are saved every 500 steps to `output/libero_spatial_ft/`. + Monitor the Hugging Face **Trainer** `loss` in the terminal. Checkpoints land under `output/libero_spatial_ft/`. - # Step 6. Evaluate the fine-tuned model + # Step 7. Evaluate the fine-tuned model - Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset: + Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to **`/tmp/open_loop_eval/`**: ```bash - CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \ + CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \ --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ --embodiment-tag LIBERO_PANDA \ --model-path output/libero_spatial_ft/checkpoint-2000/ \ @@ -228,27 +339,17 @@ spec: --action-horizon 16 ``` - The evaluation outputs: - - - **Per-trajectory MSE and MAE** printed to the terminal - - **Average MSE** across all evaluated trajectories - - **JPEG visualizations** saved to `/tmp/open_loop_eval/` showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper) - - Key things to look for in the plots: - - - **Predicted trajectories** (orange line) should closely track the **ground truth** (blue line) - - **Gripper timing** — opening and closing at the correct moments - - **Lower MSE** indicates better action prediction accuracy + **How to read the run:** the terminal prints **per-trajectory MSE/MAE** and **averages**. The JPEGs under **`/tmp/open_loop_eval/`** overlay **predicted** vs **ground-truth** trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks. > [!TIP] - > Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation. + > At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches **97.65%** in closed-loop sim. - # Step 7. Run inference timing benchmark + # Step 8. Run inference on a LIBERO sample (timing + actions) - Measure the fine-tuned model's per-step inference latency: + This step passes **LIBERO Spatial** observations through the **fine-tuned** checkpoint (the base model cannot run this embodiment). **`TORCHDYNAMO_DISABLE=1`** is included for GB300: ```bash - CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \ + TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \ --model-path output/libero_spatial_ft/checkpoint-2000/ \ --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \ --embodiment-tag LIBERO_PANDA \ @@ -257,21 +358,9 @@ spec: --action-horizon 8 ``` - > [!NOTE] - > If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. This runs inference in eager mode. See Troubleshooting for details. + **What to inspect:** the script prints a **timing breakdown** (data processing, backbone, action head, end-to-end). Compare **MSE/MAE** and latency to Step 5's base-model smoke test. In eager mode (with `TORCHDYNAMO_DISABLE=1`), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect **~3–4 s/step** on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned `checkpoint-100`); a compiled torch 2.7 + cu128 stack with Triton support for `sm_103` can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks. - The timing output breaks down into: - - - **Data processing** — loading and preprocessing the observation - - **Backbone** — vision-language model forward pass - - **Action head** — diffusion transformer denoising (4 steps) - - **End-to-end** — total inference time per action chunk - - In eager mode (without `torch.compile`), expect ~240 ms per step. With `torch.compile` working, expect ~38 ms per step comparable to H100. - - # Step 8. Clean up - - To remove the environment: + # Step 9. Clean up ```bash deactivate @@ -279,15 +368,15 @@ spec: rm -rf Isaac-GR00T ``` - Your fine-tuned checkpoints in `output/libero_spatial_ft/` are deleted with the repo. Copy them elsewhere first if you want to keep them. + Fine-tuned checkpoints under `output/libero_spatial_ft/` are removed with the repo. Copy them elsewhere first if you want to keep them. # Next steps - - **Increase training steps** — Change `--max-steps` to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps). - - **Try other LIBERO suites** — Download `libero_10_no_noops`, `libero_goal_no_noops`, or `libero_object_no_noops` datasets from the `IPEC-COMMUNITY` HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%. - - **Closed-loop simulation evaluation** — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the [LIBERO evaluation guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/examples/LIBERO/README.md#evaluate-checkpoint) for server-client setup. - - **Custom embodiments** — Fine-tune GR00T on your own robot data by following the [custom embodiment guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/finetune_new_embodiment.md). Requires converting your data to LeRobot v2 format and defining a modality config. - - **Experiment with batch size** — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling `--tune-llm` or `--tune-visual` increases memory usage significantly. + - **Increase training steps** — `--max-steps 20000` for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput). + - **Other LIBERO suites** — `libero_10_no_noops`, `libero_goal_no_noops`, `libero_object_no_noops` from **IPEC-COMMUNITY** on Hugging Face. + - **Closed-loop sim** — LIBERO sim server/client: [LIBERO evaluation in Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/examples/LIBERO/README.md#evaluate-checkpoint). + - **Custom embodiments** — [Fine-tune a new embodiment](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/getting_started/finetune_new_embodiment.md) (LeRobot v2 + modality JSON). + - **Tune more of the stack** — `--tune-llm` / `--tune-visual` raise memory use; probe batch size if you enable them. @@ -298,6 +387,68 @@ spec: content: | # Common Issues + ## Issue: `git clone` fails or demo videos are tiny / missing (Git LFS) + + **Solution:** + + ```bash + sudo apt-get install -y git-lfs + git lfs install + ``` + + Remove any partial `Isaac-GR00T` directory, then clone again with `--recurse-submodules`. + + ## Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook + + **Cause:** The repository default branch (**`main`**) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts. + + **Solution:** + + ```bash + cd Isaac-GR00T + git fetch origin + git checkout n1.6-release + git submodule update --init --recursive + ``` + + Always run playbook commands from **`n1.6-release`** for **N1.6** + **GR00T-N1.6-3B**. + + ## Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes + + **Facts:** `scripts/deployment/dgpu/install_deps.sh` runs **`sudo apt-get`** to install **`ffmpeg`**, **`libaio-dev`**, and (on aarch64) FFmpeg **development** libraries for the **torchcodec** build. If **`/usr/local/cuda`** does not exist, it adds the NVIDIA CUDA apt repo and installs **`cuda-toolkit-12-8`**. It also installs **`uv`** into the user account if missing, then **`uv sync`** + **`uv pip install -e .`** into **`.venv`**. + + **Solution (policy-friendly):** Pre-install the same system packages and CUDA using your IT process, ensure **`nvcc`** works, then from the repo root: + + ```bash + export PATH="$HOME/.local/bin:$PATH" + uv sync + uv pip install -e . + ``` + + On **aarch64**, you still need **`torchcodec`** in `.venv` or rely on the **PyAV patch** (Instructions Step 2) plus **`uv pip install av`**. + + ## Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64 + + **Cause:** Upstream `pyproject.toml` lists pre-built `flash-attn==2.7.4.post1` wheels only for `linux_x86_64`. On **aarch64** (Grace + GB300), `uv` falls back to a from-source build that compiles ~72 CUDA kernels — typically **~2 hours** end-to-end. + + **Solution:** Pin to `flash-attn==2.8.1` and use the GitHub release's prebuilt aarch64 wheel. Edit `pyproject.toml` in the repo root: + + ```toml + # under [project] dependencies, replace: + # "flash-attn==2.7.4.post1", + "flash-attn==2.8.1", + + # under [tool.uv.sources], add: + flash-attn = [ + { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl", + marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" }, + ] + ``` + + The `cu12torch2.10` aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — `uv sync` completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in. + + If you must keep `flash-attn==2.7.4.post1` (Option A path), expect the 2-hour build on first sync; subsequent `uv sync` invocations re-use the cached wheel. + ## Issue: `install_deps.sh` fails building torchcodec **Solution:** @@ -308,19 +459,20 @@ spec: I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh ``` - If the build still fails, ensure FFmpeg dev libraries are installed: + If the build still fails, install FFmpeg development libraries: ```bash sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \ - libavcodec-dev libavutil-dev libswresample-dev libswscale-dev + libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \ + pkg-config cmake build-essential pybind11-dev ``` + Then apply **Instructions Step 2** (PyAV patch) so training does not depend on a working **torchcodec** for indexed frame reads. + ## Issue: `huggingface-cli download` fails with 401 Unauthorized **Solution:** - Verify your HuggingFace token is set and valid: - ```bash echo $HF_TOKEN huggingface-cli whoami @@ -332,119 +484,153 @@ spec: export HF_TOKEN="your_token_here" ``` - Make sure you have accepted any required model agreements on the HuggingFace model page. + Accept any required license or gated-model agreements on the Hugging Face model page. + + ## Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'` + + **Cause:** The shared cache directory was previously created by a Docker container running as **root** (common on multi-user dev boxes that mount `~/.cache/huggingface` into containers without `--user`). The current user (`nvidia`) cannot write into it. + + **Solution:** point HF at a user-owned cache location for this run: + + ```bash + export HF_HOME=$HOME/hf_cache_gr00t + mkdir -p "$HF_HOME" + huggingface-cli download nvidia/GR00T-N1.6-3B + ``` + + Re-export `HF_HOME` for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown `~/.cache/huggingface` back to your user. + + ## Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint + + **Cause:** Hugging Face's xet content-addressable backend occasionally returns transient `5xx`. This blocks dataset downloads even though the underlying files are reachable via the legacy backend. + + **Solution:** disable xet for the download: + + ```bash + export HF_HUB_DISABLE_XET=1 + huggingface-cli download --repo-type dataset \ + IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \ + --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ + ``` + + ## Issue: `externally-managed-environment` or `pip` installs not going into `.venv` + + **Cause:** Debian/Ubuntu **PEP 668** blocks `pip install` onto the system Python. Mixing **`sudo pip`** with the project venv breaks the playbook. + + **Solution:** + + 1. **`source .venv/bin/activate`** — prompt should show `(.venv)`. + 2. Use **`uv pip install ...`** (or **`python -m pip install ...`**) **only** with the venv activated — never `sudo pip` for this project. + 3. If the venv was created with a broken `pip`, recreate: `rm -rf .venv` and run **`uv sync`** again from the repo root (after `n1.6-release` checkout). ## Issue: CUDA out of memory during fine-tuning **Solution:** - If fine-tuning fails with an OOM error at batch size 128, reduce the batch size: + Reduce batch size: ```bash --global-batch-size 64 ``` - Also check that no other processes are using GPU memory: + Check for other GPU processes: `nvidia-smi`. **`--tune-llm`** / **`--tune-visual`** increase memory use substantially. - ```bash - nvidia-smi - ``` + ## Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell) - If you are tuning additional model components (`--tune-llm` or `--tune-visual`), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient. + **Symptom:** - ## Issue: Triton/PTXAS errors about `sm_103a` during inference - - **Solution:** - - The bundled Triton version may not yet support SM103 (GB300). This causes errors like: - - ``` + ```text ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name' ``` - Disable `torch.compile` by prepending: + **Solution:** + + For **`scripts/deployment/standalone_inference_script.py`** (which may use **`torch.compile`**), prepend: ```bash TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ... ``` - This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default. + This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and **`open_loop_eval.py`** typically run without this compile path; use the same prefix there **only** if you see the same crash. ## Issue: `ModuleNotFoundError: No module named 'gr00t'` **Solution:** - The virtual environment is not activated. Run: - ```bash source .venv/bin/activate + pwd # .../Isaac-GR00T ``` - Verify you are in the Isaac-GR00T directory: + ## Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav` + + **Cause:** On **`n1.6-release`**, **`resolve_backend`** can select **`pyav`**, but stock **`get_frames_by_indices`** did not implement the **`pyav`** branch. + + **Solution:** Apply the playbook patch and install PyAV (see **Instructions Step 2** and `assets/patches/README.md`). + + ## Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps + + **Cause:** Fallback to **per-frame `ffmpeg` subprocess** decoding for **AV1** LIBERO clips; dataloaders starve the GPU. + + **Solution:** + + 1. Apply the **PyAV patch** (Step 2) and **`uv pip install av`**. + 2. Optionally increase **`--dataloader-num-workers`** (for example **8**) if CPUs are free. + + **Expected noise after patching:** logs may repeat `Video backend 'torchcodec' is not available, falling back to 'pyav'` — that is normal if **torchcodec** is absent. + + ## Issue: Video decoding errors / `torchcodec` not found (general) + + **Solution:** + + Prefer the **PyAV patch + `av`** path above for LIBERO on GB300. + + If you must build **torchcodec** into `.venv` manually (aarch64), with FFmpeg dev packages installed: ```bash - pwd - # Should show: .../Isaac-GR00T + # Run this from inside the Isaac-GR00T repo root (the directory that + # contains .venv). Capture its absolute path BEFORE changing directories + # so we can still reach the virtualenv after cd'ing into /tmp/torchcodec. + GR00T_ROOT="$(pwd)" + + # Sanity check — the virtualenv interpreter must already exist. + test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; } + + # Clone the torchcodec source into /tmp/torchcodec (skip if already cloned). + git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec + cd /tmp/torchcodec + + # Build torchcodec into the Isaac-GR00T virtualenv using the absolute + # path captured above (do NOT use the relative ".venv/bin/python" here — + # the current directory is /tmp/torchcodec, which has no .venv). + I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \ + uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation ``` + CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the **PyAV patch** instead. + ## Issue: Training loss is not decreasing **Solution:** - At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps: + At 2000 steps the model may still be early. If loss is flat after many steps: - 1. Verify the dataset was downloaded correctly and the modality config was copied: - ```bash - ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json - ``` - - 2. Check that the correct embodiment tag is used (`LIBERO_PANDA`, not `NEW_EMBODIMENT`). - - 3. Try increasing the learning rate to `5e-4` for faster initial convergence on short runs. + 1. Verify modality file: `ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json` + 2. Confirm **`--embodiment-tag LIBERO_PANDA`** + 3. Try **`--learning-rate 5e-4`** for faster early movement on short runs ## Issue: `nvidia-smi` shows the wrong GPU **Solution:** - On DGX Station, the GB300 may not be device 0. Find the correct index: - ```bash nvidia-smi --query-gpu=index,name --format=csv,noheader + CUDA_VISIBLE_DEVICES= python ... ``` - Use the GB300's index with `CUDA_VISIBLE_DEVICES`: + ## Issue: OpenCV or decord cannot decode LIBERO AV1 - ```bash - CUDA_VISIBLE_DEVICES=1 python ... - ``` - - ## Issue: Slow data loading during training - - **Solution:** - - Increase the number of dataloader workers: - - ```bash - --dataloader-num-workers 8 - ``` - - ## Issue: Video decoding errors (`NotImplementedError` or torchcodec not found) - - **Solution:** - - The `install_deps.sh` script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall: - - ```bash - sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \ - libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \ - pkg-config cmake build-essential pybind11-dev - - git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec - cd /tmp/torchcodec - I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \ - uv pip install --python .venv/bin/python . --no-build-isolation - cd - && rm -rf /tmp/torchcodec - ``` + **Notes:** **OpenCV** often fails on **AV1** in LIBERO assets. **decord** may lack a compatible wheel for your platform. The **PyAV** patch path is the supported mitigation in this playbook. @@ -462,7 +648,7 @@ spec: url: https://research.nvidia.com/labs/gear/gr00t-n1_6/ - - name: GR00T N1.6 Paper + - name: GR00T N1 Paper url: https://arxiv.org/abs/2503.14734 diff --git a/nvidia/station-healthcare-agent/README.md b/nvidia/station-healthcare-agent/README.md index 9c92127..4dca67c 100644 --- a/nvidia/station-healthcare-agent/README.md +++ b/nvidia/station-healthcare-agent/README.md @@ -122,10 +122,8 @@ Supporting scripts (`setup_sandbox.sh`, `check_sandbox_config.sh`, `build_viewer * Large model downloads (~86 GB) may fail on slow or unstable connections * OpenFold3 NIM takes ~3 minutes to load — the healthcheck waits automatically * **Rollback:** `openshell sandbox delete clinical-sandbox`, `make down`, `make clean` (see Cleanup in Instructions). -* **Last Updated:** 05/06/2026 - * Refocused positioning on secure local agent workflows - * Added architecture section with agent descriptions and skill file examples - * Addressed VDR feedback: documented host-Ollama port conflict, NGC docker login, multi-GPU pinning, Node.js v22 upgrade, and other infrastructure setup gaps +* **Last Updated:** 05/12/2026 + * First Publication ### Notice and disclaimers diff --git a/nvidia/station-healthcare-agent/endpoint-test.yaml b/nvidia/station-healthcare-agent/endpoint-test.yaml index a508fa5..3ddcd27 100644 --- a/nvidia/station-healthcare-agent/endpoint-test.yaml +++ b/nvidia/station-healthcare-agent/endpoint-test.yaml @@ -152,10 +152,8 @@ spec: * Large model downloads (~86 GB) may fail on slow or unstable connections * OpenFold3 NIM takes ~3 minutes to load — the healthcheck waits automatically * **Rollback:** `openshell sandbox delete clinical-sandbox`, `make down`, `make clean` (see Cleanup in Instructions). - * **Last Updated:** 05/06/2026 - * Refocused positioning on secure local agent workflows - * Added architecture section with agent descriptions and skill file examples - * Addressed VDR feedback: documented host-Ollama port conflict, NGC docker login, multi-GPU pinning, Node.js v22 upgrade, and other infrastructure setup gaps + * **Last Updated:** 05/12/2026 + * First Publication ## Notice and disclaimers diff --git a/nvidia/station-kernel-dev-ft/README.md b/nvidia/station-kernel-dev-ft/README.md index d135667..2f7463a 100644 --- a/nvidia/station-kernel-dev-ft/README.md +++ b/nvidia/station-kernel-dev-ft/README.md @@ -49,6 +49,7 @@ You will profile a LLaMA 3.1 8B fine-tuning workload, identify the key performan **Software:** - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu24.04 nvidia-smi` +- On a DGX Station, immediately confirm which device index belongs to the GB300 so later steps can target it explicitly. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` and note the index for the row showing `NVIDIA GB300`. Subsequent steps recommend `--gpus '"device=N"'` (with `N` = that index) instead of `--gpus all` so profiling and benchmark numbers stay on a single, known GPU. - Network access to pull container images from NGC and download model weights from Hugging Face. - A Hugging Face account with access to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and a [Hugging Face access token](https://huggingface.co/settings/tokens). @@ -69,14 +70,14 @@ All required assets are in the playbook directory `nvidia/station-kernel-dev-ft/ ## Time & risk -* **Estimated time:** About 2 hours. Steps 1-4 (setup through baseline profiling) take about 30 minutes. Steps 5-7 (RMSNorm kernel) take about 30 minutes. Steps 8-10 (cross-entropy kernel) take about 40 minutes. Step 11 (end-to-end integration) takes about 20 minutes. +* **Estimated time:** About 2 hours. Steps 1-4 (setup through baseline profiling) take about 30 minutes. Steps 5-7 (RMSNorm kernel) take about 30 minutes. Steps 8-10 (cross-entropy kernel) take about 40 minutes. Step 11 (end-to-end integration) takes about 20 minutes. Steps 12-13 (cleanup and next steps) are a few minutes. * **Risk level:** Low * All work runs inside a Docker container — no host system modifications. * LLaMA 3.1 8B model weights (~16 GB in BF16) are downloaded from Hugging Face on first run and cached locally. * Requires a Hugging Face token with access to the LLaMA 3.1 model. * **Rollback:** Exit the container. Your source files are preserved in the mounted `assets/` directory; everything else is discarded. -* **Last Updated:** 03/30/2026 - * First Publication +* **Last Updated:** 05/26/2026 + * First publication ## Instructions @@ -97,12 +98,22 @@ Build the development container. This creates a Docker image based on NVIDIA's P docker build -t kernel-dev-ft . ``` +Identify the GB300's device index so the container can target it explicitly. On multi-GPU DGX Station systems, pinning to a single, known GPU keeps profiling and benchmark numbers consistent across runs: + +```bash +nvidia-smi --query-gpu=index,name --format=csv,noheader +``` + +Look for the row showing `NVIDIA GB300` and note its index (commonly `0` or `1`). Use that value as `N` in the next command. + Start the container with GPU access. Pass your Hugging Face token so the container can download LLaMA 3.1 8B: ```bash +## Replace N with the GB300 index from the command above. +## On a single-GPU Station you may substitute --gpus all. docker run -it --rm \ --name kernel-dev-ft \ - --gpus all \ + --gpus '"device=N"' \ --ipc host \ -e HF_TOKEN=$HF_TOKEN \ -v "$(pwd):/workspace" \ @@ -114,6 +125,9 @@ docker run -it --rm \ > [!NOTE] > The `-v "$(pwd):/workspace"` flag mounts the current directory into the container. Any files you create or modify inside `/workspace` persist on your host machine after the container exits. The `-v ~/.cache/huggingface:/root/.cache/huggingface` mount persists downloaded model weights across container restarts so you don't need to re-download the 16 GB model each time. Everything outside these mounted paths is discarded when the container stops. +> [!IMPORTANT] +> Targeting the GB300 explicitly with `--gpus '"device=N"'` (rather than `--gpus all`) ensures `torch.cuda` and `nvidia-smi` inside the container both see the **GB300** as device `0`. Profiling and benchmark numbers later in this playbook assume a single Blackwell GPU; mixing a workstation GPU in via `--gpus all` can change scheduling and skew tokens/sec and bandwidth utilization figures. + > [!NOTE] > If you haven't set `HF_TOKEN` in your shell, export it first: `export HF_TOKEN=hf_your_token_here`. You need a Hugging Face token with access to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B). You must first accept the LLaMA 3.1 Community License Agreement on the [model page](https://huggingface.co/meta-llama/Llama-3.1-8B) before your token can download the weights. @@ -128,7 +142,7 @@ nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader Expected output should show: - Triton version 3.0 or later - PyTorch with CUDA support enabled -- Your Blackwell GPU with **compute capability 10.0** (this identifier is shared by all Blackwell GPUs — GB200, GB300, B200, B300) +- A Blackwell GPU with a `10.x` compute capability. The exact minor version depends on the SKU — for example, `nvidia-smi` reports **`10.0`** on GB200 / B200 (standard Blackwell) and **`10.3`** on GB300 / B300 (Blackwell Ultra; same GPU silicon, different host packaging). Any `10.x` value is fine for this playbook; the kernels target the Blackwell family, not a specific minor. > [!NOTE] > Unlike the CUDA C++ workflow (which requires the `nvcc` compiler and a separate compilation step), Triton is a Python library that JIT-compiles GPU code at runtime. There is no build step — you write Python, and Triton compiles it to optimized GPU machine code when you first call the kernel. @@ -166,6 +180,13 @@ Training is different for three reasons: 2. **Large vocabularies create massive intermediate tensors.** LLaMA's 128K vocabulary means the logit tensor for a single batch is enormous. Standard cross-entropy materializes this entire tensor in memory. 3. **Memory is the binding constraint.** Unlike inference (where latency matters most), training is often limited by how much data fits in GPU memory. Kernels that reduce memory enable larger batch sizes, which improve GPU utilization across *all* operations. +**Memory-bound vs compute-bound (where to spend effort):** + +- **Memory-bound** regions are limited by how fast you can move bytes through HBM (read/write bandwidth). Symptoms: small kernels, low achieved GB/s vs peak, profiler shows many narrow ops or fusion gaps. **Optimize** by fusing passes, reducing tensor materialization, using narrower dtypes where safe, and improving coalescing so each byte does more useful math. +- **Compute-bound** regions are limited by arithmetic throughput (Tensor Cores, FP32/FP16 units). Symptoms: large `aten::mm` / matmul and attention dominating self CUDA time with high utilization. **Optimize** with better tiling, larger batch sizes (more work per launch), kernels that keep math in registers, and libraries (cuBLAS, FlashAttention) before hand-writing alternatives. + +A single training step usually mixes both: matmuls tend toward **compute-bound** on large batches, while pointwise norm/loss paths are often **memory-bound**. Profiling tells you which bucket your hotspot falls into. + ## Step 3. Profile a baseline training step Now let's see where GPU time actually goes. We'll use `torch.profiler` to capture a detailed trace of a single forward + backward + optimizer step. @@ -179,6 +200,15 @@ python profile_baseline.py > [!NOTE] > The first run downloads LLaMA 3.1 8B weights (~16 GB in BF16) from Hugging Face. This takes several minutes depending on network speed. Subsequent runs use the cached weights and start immediately. +> [!NOTE] +> **Repeat runs:** `profile_baseline.py` removes any prior trace directory and Chrome JSON for the same flags before recording, so you can re-run baseline profiling without a "trace is already saved" error. + +> [!NOTE] +> **Ranking variance:** The exact ordering and percentages in the "Top 20 CUDA operations" table can change between runs, PyTorch / CUDA versions, and GPU generation. You should still see the same *categories* of work (matmuls, FlashAttention, RMSNorm decompositions, cross-entropy). **"Command Buffer Full"** (or similar) sometimes appears at the top of self-time tables: that reflects the GPU driver's **submission queue / scheduling**, not a user kernel to optimize. The script filters that row from the printed table; the raw trace in Perfetto still contains the underlying kernels. + +> [!TIP] +> **Optional Nsight Systems timeline:** For a visual timeline with CUDA API and GPU work (outside or alongside `torch.profiler`), install [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) and run something like: `nsys profile -o llama_ft_repro --trace=cuda,nvtx python profile_baseline.py` from an environment where `nsys` is on `PATH` (often the host, or a devel image with CUDA toolkit). Open the `.nsys-rep` file in the Nsight Systems GUI. + The script loads LLaMA 3.1 8B, runs one training step under `torch.profiler`, and prints a table like this: ``` @@ -380,11 +410,22 @@ Tokens Custom (us) PyTorch (us) Custom (GB/s) PyTorch (GB/s) Spee 16,384 298.9 2,041.7 2,694 394 6.83x ``` -**How to read these results:** -- **Custom (GB/s)** shows effective memory bandwidth. On large inputs (16K tokens), the fused kernel reaches ~2,700 GB/s — significantly better than PyTorch's ~400 GB/s. -- **Speedup** ranges from ~1.5x on small inputs to ~6.8x on large inputs. The improvement grows with tensor size because PyTorch's unfused operations suffer more from memory round-trips as data grows, while the fused kernel's cost stays nearly flat until the data is large enough to saturate the GPU. +**How to read these results (what "better" means):** + +| Column / metric | Better when… | +|-----------------|---------------| +| **Custom (us)** | **Lower** is faster (fewer microseconds per forward+backward pass). | +| **PyTorch (us)** | Reference only; same rule (lower is faster). | +| **Custom (GB/s)** | **Higher** means you move closer to HBM peak (more useful bytes per second for this fused region). | +| **Speedup** | **Higher** means the custom kernel beats PyTorch by a larger factor on that row. | + +- **Custom (GB/s)** shows effective memory bandwidth. On large inputs (16K tokens), the fused kernel typically reaches much higher GB/s than the unfused PyTorch path. +- **Speedup** often ranges from roughly **1.5x** on small inputs to **6x+** on large inputs in internal runs. - These numbers measure **forward + backward combined**, which is what matters for training. +> [!NOTE] +> **Treat the table as illustrative, not a target.** Absolute microsecond and GB/s values **can differ by an order of magnitude** between GB300 stacks (different driver versions, NGC PyTorch builds, clock states, autograd overhead between iterations). On the validation run for this playbook the same shapes measured ~4,000–5,000 µs per fwd+bwd instead of ~300 µs, while still showing **custom faster than PyTorch and the gap widening with `num_tokens`**. The **direction of the speedup** (and the GB/s ratio between custom and PyTorch in the same run) is the stable signal — match those, and your kernel is healthy. + Now re-profile the full training step with the custom RMSNorm to confirm the bottleneck is resolved: ```bash @@ -460,7 +501,7 @@ Test 1: Float32 Test 2: BFloat16 (relaxed tolerance) BF16 Loss — ref: 12.250000 custom: 12.247243 diff: 2.76e-03 PASSED - BF16 Gradient — max diff: 2.98e-08 PASSED + BF16 Gradient — max diff (fp32 compare): 1.23e-01 PASSED Memory Comparison ------------------------------------------------------------ @@ -475,7 +516,10 @@ All cross-entropy tests PASSED The **memory comparison** shows that standard PyTorch cross-entropy allocates ~500 MB (for the softmax output and other intermediates), while the fused kernel uses ~250 MB. The 2x reduction measured here understates the real benefit: in the benchmark (Step 10), where memory is measured more precisely per-operation, the reduction is **~6x**. The larger benefit appears because the benchmark isolates just the cross-entropy overhead, while this test includes the base logit tensor allocation in both measurements. > [!NOTE] -> Cross-entropy involves `log(sum(exp(...)))`, which is numerically sensitive. The online softmax algorithm maintains stability through the running-max trick — subtracting the maximum logit before exponentiating prevents overflow. The loss values should match PyTorch within 1e-5 in FP32 or 1e-2 in BF16. +> Cross-entropy involves `log(sum(exp(...)))`, which is numerically sensitive. The online softmax algorithm maintains stability through the running-max trick — subtracting the maximum logit before exponentiating prevents overflow. FP32 checks use tight tolerances. **BF16** compares loss with relaxed `atol/rtol` and compares **gradients in float32** with wider tolerances (`atol=2e-1`, `rtol=2e-1`) so chunked reductions over 128K vocabulary do not false-fail against PyTorch's different accumulation order. + +> [!WARNING] +> BF16 tolerances are intentionally looser than FP32: they assert the custom kernel matches the reference **within training-usable error**, not bitwise. Tighten tolerances only if you change the algorithm or dtype strategy. ## Step 10. Benchmark and re-profile cross-entropy @@ -502,10 +546,14 @@ Tokens Custom (us) PyTorch (us) Speedup Custom Mem (MB) PyTorch M 1,024 315 1,277 4.06x 251 1,506 6.0x ``` -**How to read these results:** -- **Speedup** grows from slower at 128 tokens (kernel launch overhead dominates) to **4x at 1,024 tokens**. The fused kernel has a higher fixed cost per row (looping over 128K vocabulary in chunks) but scales much better because it avoids the massive intermediate softmax allocation. -- **Memory reduction** (~6x): PyTorch allocates separate tensors for the logits, softmax output, and loss gradients. The fused kernel avoids the softmax intermediary. For 1,024 tokens, this saves over 1 GB of GPU memory — room for larger batches or longer sequences. -- At very small batch sizes (128 tokens), the fused kernel is **slower**. This is expected and normal — the overhead of the online softmax loop exceeds the cost of PyTorch's bulk computation at small scales. The crossover point is around 256 tokens. +**How to read these results:** For latency columns, **lower microseconds is better**. For **Speedup**, **higher is better** (custom faster than PyTorch). For **Mem Reduction**, **higher is better** (more peak memory saved). + +- **Speedup** grows from slower at 128 tokens (kernel launch overhead dominates) to several times faster at 1,024 tokens in typical runs. +- **Memory reduction** (~6x in the table): PyTorch allocates separate tensors for the logits, softmax output, and loss gradients. The fused kernel avoids the softmax intermediary. For 1,024 tokens, this saves over 1 GB of GPU memory, room for larger batches or longer sequences. +- At very small token counts (128), the fused kernel can be **slower**. That is expected: the online softmax loop has fixed per-row overhead. The crossover is often near 256–1,024 tokens depending on stack. + +> [!NOTE] +> Same caveat as in Step 7: **absolute microseconds in the example table are illustrative**. On the validation run for this playbook the per-iteration latencies were several thousand µs rather than the ~300 µs printed above, while the **memory reduction** (~6x) and the **speedup direction** (fused becomes faster as `num_tokens` grows) remained stable. Trust the **shape** of the table and the **memory column**, not the absolute latency numbers. Now re-profile with both custom kernels active: @@ -531,24 +579,27 @@ Then run the optimized version with both custom kernels: python finetune_optimized.py ``` -Example comparison: +Example comparison on **GB300** (default `--batch-size 1`, `--seq-len 512`; throughput is `batch * seq_len / step_time`; numbers below are **illustrative**, not a target): ``` ====================================================================== Baseline Results ====================================================================== - Average time per step: 1.842 s - Average throughput: 278 tokens/sec + Average time per step: 0.201 s + Average throughput: 2540 tokens/sec (illustrative) Peak GPU memory: 112.4 GB ====================================================================== - Optimized Results + Optimized Results (illustrative) ====================================================================== - Average time per step: 1.614 s - Average throughput: 317 tokens/sec + Average time per step: 0.194 s + Average throughput: 2640 tokens/sec (illustrative) Peak GPU memory: 78.6 GB ``` +> [!NOTE] +> **Treat the throughput numbers above as illustrative, not a target** — same caveat as the RMSNorm (Step 7) and cross-entropy (Step 10) benchmark notes. Absolute tok/s and step time **vary** with GPU generation, clocks, PyTorch / CUDA builds, and whether the warm-up pass included JIT; older runs near **~280 tok/s** were observed on different stacks. The **stable signals** are (1) the **relative gap** — optimized > baseline — and (2) the **peak GPU memory delta** (the cross-entropy memory reduction is what frees room for larger batch sizes). Match those, not the absolute tok/s. + **How the custom kernels are integrated:** The `finetune_optimized.py` script uses the "surgical replacement" pattern to swap in custom kernels without modifying the model source code: @@ -577,7 +628,7 @@ This pattern — find modules by type, create optimized replacements, swap them > [!NOTE] > **Amdahl's law in action.** An 8x faster RMSNorm does not make training 8x faster. If RMSNorm was 10% of total step time, making it 8x faster saves about 8.75% of total time. The cross-entropy memory reduction has an outsized impact because it frees GPU memory that enables larger batch sizes, which improves GPU utilization across *all* operations — including the matrix multiplications and attention that dominate the compute profile. -## Step 12. Cleanup and next steps +## Step 12. Cleanup When you're finished, exit the container: @@ -602,16 +653,16 @@ To remove downloaded model weights cached by Hugging Face: rm -rf ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B ``` -**Next steps:** +## Step 13. Next steps -You've profiled a real training workload, identified the bottlenecks, written custom Triton kernels to address them, and measured end-to-end improvements. Here's where to go next: +You profiled a real training workload, identified bottlenecks, shipped custom Triton kernels, and measured end-to-end impact. Continue from here with: -- **Fused Linear Cross-Entropy** — The kernel we wrote takes pre-computed logits as input. A more advanced variant fuses the `lm_head` linear projection with the cross-entropy, computing logits chunk-by-chunk and never materializing the full `[B*T, V]` tensor at all. See [Liger-Kernel's FusedLinearCrossEntropy](https://github.com/linkedin/Liger-Kernel) for a production implementation. -- **Fused SwiGLU with backward** — The [Custom CUDA Kernel Development](https://build.nvidia.com/nvidia/station-kernel-dev) playbook covered inference-only SwiGLU. For training, you need the backward pass too. Triton makes this straightforward with the `torch.autograd.Function` pattern used in this playbook. -- **Liger-Kernel integration** — Instead of writing every kernel yourself, use [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) as a drop-in optimization: `pip install liger-kernel` and `apply_liger_kernel_to_llama()`. Compare its throughput against your hand-written kernels. -- **Larger batch sizes** — The memory freed by fused cross-entropy allows increasing batch size. Re-profile with `--batch-size 2` or `--batch-size 4` to see how GPU utilization improves when more compute work is available per step. -- **LoRA fine-tuning** — Apply the same profiling methodology to LoRA/QLoRA fine-tuning. The bottleneck profile is different (fewer optimizer states, more activation memory relative to weights), which reveals different optimization opportunities. -- **Multi-GPU training** — The custom kernels work transparently with PyTorch FSDP and DDP. Each GPU runs its own copy of the kernel independently — no changes needed. +- **Fused Linear Cross-Entropy:** The kernel in this playbook takes pre-computed logits. A more advanced variant fuses the `lm_head` linear projection with the cross-entropy so logits are produced chunk-by-chunk and the full `[B*T, V]` tensor is never stored. See [Liger-Kernel's FusedLinearCrossEntropy](https://github.com/linkedin/Liger-Kernel). +- **Fused SwiGLU with backward:** The [Custom CUDA Kernel Development](https://build.nvidia.com/nvidia/station-kernel-dev) playbook covered inference-only SwiGLU. Training needs the backward pass; use the same `torch.autograd.Function` pattern as here. +- **Liger-Kernel integration:** `pip install liger-kernel` and `apply_liger_kernel_to_llama()`, then compare throughput to your hand-written kernels. +- **Larger batch sizes:** Fused cross-entropy frees memory. Re-profile with `--batch-size 2` or `--batch-size 4` to see utilization when more matmul work sits behind each step. +- **LoRA fine-tuning:** Re-run the profiling methodology on LoRA or QLoRA. Bottlenecks shift (fewer optimizer states, different activation pressure). +- **Multi-GPU training:** These kernels compose with FSDP and DDP unchanged (each rank runs its own Triton programs). ## Troubleshooting @@ -619,7 +670,8 @@ You've profiled a real training workload, identified the bottlenecks, written cu |---------|-------|-----| | `ModuleNotFoundError: No module named 'triton'` | Container missing Triton | Use the `kernel-dev-ft` container built from the playbook's Dockerfile. Triton ships with PyTorch NGC containers. Verify: `python -c "import triton; print(triton.__version__)"`. | | `triton.compiler.errors.CompilationError` referencing `sm_100` | Triton version too old for Blackwell | Use PyTorch NGC container 26.01+ which includes Triton with Blackwell support. Check: `python -c "import triton; print(triton.__version__)"`. | -| Correctness test fails with large differences in BF16 | Using FP32 tolerance for BF16 comparison | BF16 has only 7 mantissa bits. Use `atol=1e-2, rtol=1e-2` for `torch.allclose`. Differences up to ~0.01 are normal. | +| Cross-entropy BF16 test fails on loss or gradient | BF16 + 128K vocab accumulate drift vs PyTorch's CE path | `cross_entropy_test.py` uses relaxed loss tolerances and compares **gradients in float32** with wider `atol/rtol`. If it still fails, check PyTorch / CUDA versions; file an issue with `torch.__version__`. | +| `RuntimeError: Trace is already saved` from profiler | Stale `traces/` directory from a previous run | Use the latest `profile_baseline.py` (it deletes the prior trace dir and Chrome JSON). Or run `rm -rf traces/trace traces/trace_*` before profiling. | | `torch.cuda.OutOfMemoryError` during baseline profiling | Batch size or sequence length too large | Reduce `--batch-size` or `--seq-len` in `profile_baseline.py`. LLaMA 3.1 8B in BF16 needs ~16 GB for weights alone, plus ~32 GB for AdamW optimizer states. | | `torch.cuda.OutOfMemoryError` during PyTorch cross-entropy but NOT during custom kernel | Standard cross-entropy materializes full `[B*T, V]` logit tensor | This demonstrates exactly why the custom kernel is needed. Reduce batch size or sequence length for the baseline comparison, or run only the custom kernel path. | | Profiler trace JSON is very large (>1 GB) | Too many training steps profiled | Reduce `wait`, `warmup`, `active` in the profiler schedule. The default script profiles only 1 active step. | diff --git a/nvidia/station-kernel-dev-ft/endpoint-production.yaml b/nvidia/station-kernel-dev-ft/endpoint-production.yaml new file mode 100644 index 0000000..2d899f5 --- /dev/null +++ b/nvidia/station-kernel-dev-ft/endpoint-production.yaml @@ -0,0 +1,695 @@ +kind: Playbook +metadata: + name: station-kernel-dev-ft + displayName: Profiler-Driven Kernel Optimization for Fine-Tuning + shortDescription: Use torch.profiler to find training bottlenecks, then write custom Triton kernels to optimize LLaMA 8B fine-tuning + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - GB300 + - Triton + - Kernel Development + - Fine-Tuning + - Performance Optimization + - Training + - LLaMA + + attributes: + - key: DURATION + value: 2 HRS + +spec: + artifactName: station-kernel-dev-ft + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: View on GitHub + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-kernel-dev-ft/ + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + DGX Station puts a full Blackwell GPU on your desk, which makes it an ideal environment for profiling and optimizing GPU kernels used during model training. This playbook walks through a real optimization workflow: **profiling a LLaMA 3.1 8B fine-tuning run to identify bottlenecks, then writing custom Triton kernels that eliminate those bottlenecks** — specifically a fused RMSNorm and a fused cross-entropy loss using online softmax. + + For inference workloads, tools like `torch.compile` and serving frameworks (vLLM, TensorRT-LLM) already ship highly optimized fused kernels. But training workloads are different. Backward passes double the kernel count, large vocabularies create massive intermediate tensors during loss computation, and `torch.compile` does not restructure algorithms to avoid these allocations. Projects like [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) and [Unsloth](https://github.com/unslothai/unsloth) demonstrate that custom training kernels deliver real results: 20-60% memory reduction and 10-30% throughput improvement. + + This playbook uses **Triton** instead of raw CUDA C++. Triton is a Python-native GPU programming language that JIT-compiles to optimized GPU code — no `nvcc` compiler, no C++ build systems, no manual thread indexing. It is the standard for custom training kernels: Liger-Kernel, Unsloth, and FlashAttention are all written in Triton. + + **No prior Triton, CUDA, or GPU programming experience is required.** The instructions explain each concept as it comes up. + + # What you'll accomplish + + You will profile a LLaMA 3.1 8B fine-tuning workload, identify the key performance bottlenecks, and write custom Triton kernels that address them. + + - **Profile** a baseline fine-tuning step using `torch.profiler` and interpret the results to identify two targets: RMSNorm (memory-bandwidth-bound) and cross-entropy loss (memory-capacity-bound). + - **Write a fused RMSNorm kernel** in Triton that processes normalization in a single GPU pass instead of multiple separate operations, improving memory bandwidth utilization from ~11% to ~80-90% of peak. + - **Write a fused cross-entropy kernel** using the online softmax algorithm (Milakov-Gimelshein) that computes loss without materializing intermediate softmax tensors, achieving ~6x memory reduction and up to 4x latency improvement at realistic batch sizes. + - **Verify correctness** of both kernels (forward and backward passes) against PyTorch reference implementations. + - **Benchmark** the kernels to measure latency, throughput, and memory savings. + - **Integrate** both kernels into an end-to-end LLaMA 3.1 8B fine-tuning loop and measure real training throughput and memory improvements. + + # What to know before starting + + - Comfortable with Linux command line and shell scripting. + - Basic familiarity with Python and PyTorch (tensors, autograd, training loops). + - Understanding of what fine-tuning is (training a pre-trained model on new data). + - No Triton, CUDA, or GPU programming experience required — all code is explained. + + # Prerequisites + + **Hardware:** + - NVIDIA DGX Station with GB300 Ultra Superchip. + - At least 150 GB available storage for the container image, model weights (~16 GB for LLaMA 3.1 8B in BF16), profiler traces, and optimizer states. + + **Software:** + - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu24.04 nvidia-smi` + - Network access to pull container images from NGC and download model weights from Hugging Face. + - A Hugging Face account with access to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and a [Hugging Face access token](https://huggingface.co/settings/tokens). + + # Ancillary files + + All required assets are in the playbook directory `nvidia/station-kernel-dev-ft/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). + + - `assets/Dockerfile` — Development container based on NVIDIA's PyTorch NGC image with Triton, transformers, and profiling dependencies. + - `assets/requirements.txt` — Python dependencies installed inside the container. + - `assets/profile_baseline.py` — Profiling script that captures a `torch.profiler` trace of a LLaMA 3.1 8B training step and prints a breakdown of GPU time by operation. Supports flags to enable custom kernels for re-profiling. + - `assets/rmsnorm_kernel.py` — Fused RMSNorm Triton kernel with forward and backward passes, wrapped as a drop-in `torch.nn.Module` replacement. Heavily commented with explanations of each Triton concept. + - `assets/rmsnorm_test.py` — Correctness tests comparing the custom RMSNorm against PyTorch's reference implementation (forward and backward, FP32 and BF16). + - `assets/cross_entropy_kernel.py` — Fused cross-entropy Triton kernel using online softmax, with forward and backward passes. Processes the vocabulary in chunks to avoid materializing the full logit tensor. + - `assets/cross_entropy_test.py` — Correctness tests and memory usage comparison against `torch.nn.CrossEntropyLoss`. + - `assets/benchmark_kernels.py` — Benchmarking script that measures latency, throughput, bandwidth utilization, and peak memory for both custom kernels. + - `assets/finetune_baseline.py` — Minimal LLaMA 3.1 8B fine-tuning script using vanilla PyTorch, reporting tokens/sec and peak memory. + - `assets/finetune_optimized.py` — Identical fine-tuning script with both custom kernels monkey-patched in for direct comparison. + + # Time & risk + + * **Estimated time:** About 2 hours. Steps 1-4 (setup through baseline profiling) take about 30 minutes. Steps 5-7 (RMSNorm kernel) take about 30 minutes. Steps 8-10 (cross-entropy kernel) take about 40 minutes. Step 11 (end-to-end integration) takes about 20 minutes. + * **Risk level:** Low + * All work runs inside a Docker container — no host system modifications. + * LLaMA 3.1 8B model weights (~16 GB in BF16) are downloaded from Hugging Face on first run and cached locally. + * Requires a Hugging Face token with access to the LLaMA 3.1 model. + * **Rollback:** Exit the container. Your source files are preserved in the mounted `assets/` directory; everything else is discarded. + * **Last Updated:** 03/30/2026 + * First Publication + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Set up the development environment + + Before profiling or writing any GPU kernels, you need a development environment with PyTorch, Triton (the GPU programming language we'll use), and the tools to load LLaMA 3.1 8B. We use a Docker container so everything is pre-configured and isolated from your host system. + + Clone the playbook repository and navigate to the assets directory: + + ```bash + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-kernel-dev-ft/assets + ``` + + Build the development container. This creates a Docker image based on NVIDIA's PyTorch NGC container with additional libraries for model loading and benchmarking: + + ```bash + docker build -t kernel-dev-ft . + ``` + + Start the container with GPU access. Pass your Hugging Face token so the container can download LLaMA 3.1 8B: + + ```bash + docker run -it --rm \ + --name kernel-dev-ft \ + --gpus all \ + --ipc host \ + -e HF_TOKEN=$HF_TOKEN \ + -v "$(pwd):/workspace" \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + -w /workspace \ + kernel-dev-ft + ``` + + > [!NOTE] + > The `-v "$(pwd):/workspace"` flag mounts the current directory into the container. Any files you create or modify inside `/workspace` persist on your host machine after the container exits. The `-v ~/.cache/huggingface:/root/.cache/huggingface` mount persists downloaded model weights across container restarts so you don't need to re-download the 16 GB model each time. Everything outside these mounted paths is discarded when the container stops. + + > [!NOTE] + > If you haven't set `HF_TOKEN` in your shell, export it first: `export HF_TOKEN=hf_your_token_here`. You need a Hugging Face token with access to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B). You must first accept the LLaMA 3.1 Community License Agreement on the [model page](https://huggingface.co/meta-llama/Llama-3.1-8B) before your token can download the weights. + + Verify the toolchain inside the container: + + ```bash + python -c "import triton; print(f'Triton {triton.__version__}')" + python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}')" + nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader + ``` + + Expected output should show: + - Triton version 3.0 or later + - PyTorch with CUDA support enabled + - Your Blackwell GPU with **compute capability 10.0** (this identifier is shared by all Blackwell GPUs — GB200, GB300, B200, B300) + + > [!NOTE] + > Unlike the CUDA C++ workflow (which requires the `nvcc` compiler and a separate compilation step), Triton is a Python library that JIT-compiles GPU code at runtime. There is no build step — you write Python, and Triton compiles it to optimized GPU machine code when you first call the kernel. + + # Step 2. Understand the fine-tuning workload + + Before profiling, let's build a mental model of where GPU time goes during LLaMA 3.1 8B fine-tuning and why certain operations are candidates for custom kernels. + + **LLaMA 3.1 8B architecture at a glance:** + + | Property | Value | + |----------|-------| + | Parameters | 8.03 billion | + | Layers | 32 transformer blocks | + | Hidden size | 4096 | + | Attention heads | 32 | + | Key/value heads | 8 (grouped-query attention) | + | Vocabulary | 128,256 tokens | + | Normalization | RMSNorm (not LayerNorm) | + | Activation | SwiGLU (SiLU-gated MLP) | + + **Memory budget for full fine-tuning in BF16:** + - Model weights: 8B params x 2 bytes = ~16 GB + - AdamW optimizer states: 8B params x 8 bytes (FP32 copy + first moment + second moment) = ~64 GB + - Gradients: 8B params x 2 bytes = ~16 GB + - Activations: varies with batch size and sequence length + - **Total: ~96 GB minimum**, fitting comfortably in DGX Station's 252 GB HBM3e + + **Why training is different from inference for kernel optimization:** + + For inference, `torch.compile` and serving frameworks like vLLM already fuse most pointwise operations automatically. Writing a custom SiLU or SwiGLU kernel for inference is reinventing what's already solved. + + Training is different for three reasons: + 1. **Backward passes double the kernel count.** Every forward operation has a corresponding backward operation for gradient computation. `torch.compile` handles some of these but cannot restructure algorithms (like how loss is computed). + 2. **Large vocabularies create massive intermediate tensors.** LLaMA's 128K vocabulary means the logit tensor for a single batch is enormous. Standard cross-entropy materializes this entire tensor in memory. + 3. **Memory is the binding constraint.** Unlike inference (where latency matters most), training is often limited by how much data fits in GPU memory. Kernels that reduce memory enable larger batch sizes, which improve GPU utilization across *all* operations. + + # Step 3. Profile a baseline training step + + Now let's see where GPU time actually goes. We'll use `torch.profiler` to capture a detailed trace of a single forward + backward + optimizer step. + + Run the profiling script: + + ```bash + python profile_baseline.py + ``` + + > [!NOTE] + > The first run downloads LLaMA 3.1 8B weights (~16 GB in BF16) from Hugging Face. This takes several minutes depending on network speed. Subsequent runs use the cached weights and start immediately. + + The script loads LLaMA 3.1 8B, runs one training step under `torch.profiler`, and prints a table like this: + + ``` + ====================================================================== + Top 20 CUDA Operations by Total GPU Time + ====================================================================== + Name Self CUDA Self CUDA % # Calls + ------------------------------------ ---------- ----------- -------- + aten::mm 152.3ms 42.1% 258 + aten::_flash_attention_forward 48.7ms 13.5% 32 + aten::_flash_attention_backward 41.2ms 11.4% 32 + aten::_scaled_mm 28.1ms 7.8% 2 + aten::pow 12.4ms 3.4% 65 + aten::mean 11.8ms 3.3% 65 + aten::rsqrt 8.2ms 2.3% 65 + aten::mul 15.6ms 4.3% 198 + aten::_log_softmax 8.9ms 2.5% 1 + aten::nll_loss_forward 3.2ms 0.9% 1 + ... + ``` + + **How to read these results:** + + - **`aten::mm`** (matrix multiplications): The largest single category (~42% of GPU time). These are already highly optimized by cuBLAS. Not a target for custom kernels. + - **`aten::_flash_attention_forward/backward`** (~25% combined): Already optimized by FlashAttention. Not a target. + - **`aten::pow`, `aten::mean`, `aten::rsqrt`, and some `aten::mul` calls**: These are the **RMSNorm** operations, broken into separate kernels. Individually small, but there are many of them (32 layers x 2 norms per layer + 1 model norm = 65 in the forward pass, plus corresponding backward operations). In aggregate, they consume significant time and make many redundant memory round-trips. + - **`aten::_log_softmax` + `aten::nll_loss_forward`**: This is the **cross-entropy loss** computation. Only called once, but it operates over the full `[batch_size * seq_len, 128256]` logit tensor. + + The profiler also saves a Chrome trace file. You can inspect it visually: + + > [!TIP] + > Open the Chrome trace JSON in [Perfetto UI](https://ui.perfetto.dev/) for an interactive timeline view. Look for sequences of narrow bars (small kernels with gaps between them) — these represent unfused operations where the GPU reads and writes the same data multiple times. + + # Step 4. Understand why these operations are slow + + Before writing kernels, let's understand *why* our two targets are slow. This understanding will guide the kernel design. + + **RMSNorm is memory-bandwidth-bound.** + + The formula for RMSNorm is: + + ``` + RMSNorm(x) = (x / sqrt(mean(x^2) + eps)) * weight + ``` + + PyTorch's default implementation breaks this into separate GPU operations: + 1. `x.pow(2)` — square each element → writes result to memory + 2. `.mean(-1)` — reduce across hidden dimension → reads result, writes mean + 3. `+ eps` then `.rsqrt()` — reads mean, writes inverse RMS + 4. `x * rnorm` — reads x again, reads rnorm, writes normalized output + 5. `* weight` — reads output, reads weight, writes final result + + Each of these reads from and writes to GPU memory (HBM). For `hidden_size=4096` in BF16, a single row is 8 KB. The unfused version reads and writes this data **5+ times**. A fused kernel reads it **once** and writes **once**. + + The DGX Station GB300's HBM3e has ~8 TB/s of bandwidth. PyTorch's unfused RMSNorm typically achieves only ~11% of this peak. A fused kernel can reach ~80-90% — a dramatic improvement for an operation that runs 66 times per training step. + + **Cross-entropy is memory-capacity-bound.** + + Standard cross-entropy computes `softmax(logits)` over the full vocabulary for every token position. For LLaMA 3.1 8B: + + ``` + logit tensor shape: [batch_size * seq_len, 128256] + For batch_size=1, seq_len=512: [512, 128256] + Memory: 512 * 128256 * 4 bytes (float32) ≈ 250 MB + ``` + + PyTorch also saves the softmax output for the backward pass, roughly doubling this to ~500 MB. As batch size or sequence length grows, this scales linearly. + + The **online softmax** trick (Milakov & Gimelshein, 2018) avoids materializing the full logit tensor. Instead of computing softmax all at once, it processes the vocabulary in chunks while maintaining two running values: + - **`m`**: the running maximum logit (for numerical stability) + - **`d`**: the running sum of `exp(logit - m)` (the softmax denominator) + + Here's the algorithm with a small example. Suppose we have 8 logits `[2, 5, 1, 3, 4, 7, 2, 6]` and process them in chunks of 4: + + **Chunk 1: `[2, 5, 1, 3]`** + - `m = 5` (max of chunk) + - `d = exp(2-5) + exp(5-5) + exp(1-5) + exp(3-5) = 0.050 + 1.0 + 0.018 + 0.135 = 1.203` + + **Chunk 2: `[4, 7, 2, 6]`** + - `chunk_max = 7`, `new_m = max(5, 7) = 7` + - Rescale previous `d`: `d = 1.203 * exp(5 - 7) + exp(4-7) + exp(7-7) + exp(2-7) + exp(6-7)` + - `d = 1.203 * 0.135 + 0.050 + 1.0 + 0.007 + 0.368 = 1.587` + + After all chunks: `loss = log(d) + m - logit[target]`. No `[8]`-sized softmax tensor was ever allocated — just two scalars (`m`, `d`) maintained across chunks. + + For `V=128256`, this reduces the *algorithmic* memory from `O(B*T*V)` to `O(B*T)` per row. In practice, the input logit tensor is still allocated (PyTorch needs it for the backward pass), so the measured end-to-end reduction is ~6x — still a significant saving that frees hundreds of megabytes at realistic batch sizes. + + # Step 5. Write the fused RMSNorm Triton kernel + + Let's start with the simpler kernel. Open `rmsnorm_kernel.py` to review the implementation: + + ```bash + cat rmsnorm_kernel.py + ``` + + This file contains four components: + 1. **`_rmsnorm_fwd_kernel`** — The forward pass Triton kernel + 2. **`_rmsnorm_bwd_kernel`** — The backward pass Triton kernel + 3. **`TritonRMSNormFunction`** — A `torch.autograd.Function` that connects the kernels to PyTorch's autograd + 4. **`TritonRMSNorm`** — A drop-in `nn.Module` replacement for `LlamaRMSNorm` + + **Key Triton concepts in the forward kernel:** + + ```python + @triton.jit + def _rmsnorm_fwd_kernel(X_ptr, W_ptr, Y_ptr, Rnorm_ptr, stride_x, hidden_size, eps, BLOCK_SIZE: tl.constexpr): + row_idx = tl.program_id(0) + ... + ``` + + - **`@triton.jit`** marks a function for GPU compilation. This is Triton's equivalent of CUDA's `__global__` keyword, but instead of writing C++, you write Python-like code. Triton's compiler handles thread management, memory coalescing, and vectorization automatically. + + - **`tl.program_id(0)`** returns a unique index for each "program" (similar to a CUDA thread block). Each program handles one row of the input tensor. For a batch of 512 tokens with hidden_size=4096, we launch 512 programs. + + - **`tl.load(X_ptr + row_start + offsets, mask=mask, other=0.0)`** loads a vector of values from GPU memory into registers. The `mask` ensures we don't read beyond the row boundary. The `other=0.0` provides a default value for masked-out elements. + + - **`BLOCK_SIZE: tl.constexpr`** is a compile-time constant. Triton generates specialized GPU code for each value of `BLOCK_SIZE`. For `hidden_size=4096`, we use `BLOCK_SIZE=4096` (the next power of 2), meaning each program loads the entire row in one batch. + + **The key optimization:** + + ```python + # One pass: read x, compute variance, normalize, multiply by weight, write y + x_fp32 = x.to(tl.float32) + variance = tl.sum(x_fp32 * x_fp32, axis=0) / hidden_size + rnorm = 1.0 / tl.sqrt(variance + eps) + y = (x_fp32 * rnorm).to(x.dtype) * w + tl.store(Y_ptr + row_start + offsets, y, mask=mask) + ``` + + The entire RMSNorm computation — square, mean, rsqrt, normalize, scale by weight — happens in registers without intermediate writes to GPU memory. Compare this to PyTorch's 5 separate kernel launches, each with a full memory round-trip. + + **The backward kernel** follows the same pattern: load everything needed for one row, compute both `grad_x` and `grad_w` in registers, write once. The mathematical derivation is documented in the kernel source comments. + + **The autograd wrapper** (`TritonRMSNormFunction`) connects the kernels to PyTorch's automatic differentiation: + - `forward()` calls the forward kernel and saves `x`, `weight`, and `rnorm` for later. + - `backward()` receives the upstream gradient, calls the backward kernel, and returns gradients for `x` and `weight`. + + > [!NOTE] + > **Triton vs. CUDA C++:** In the [Custom CUDA Kernel Development](https://build.nvidia.com/nvidia/station-kernel-dev) playbook, we wrote CUDA C++ with explicit thread indexing (`blockIdx.x`, `threadIdx.x`), manual `float4` vectorization, `nvcc` compilation, and `ctypes` bindings. Triton abstracts all of that — you write Python-like code, and the compiler handles vectorization, memory coalescing, and PTX generation automatically. The tradeoff is less fine-grained hardware control, but for operations like RMSNorm, Triton matches hand-tuned CUDA performance. + + # Step 6. Test RMSNorm for correctness + + Before measuring performance, verify the kernel produces the same results as PyTorch's implementation. Even small numerical errors can cascade through a 32-layer transformer and produce garbage gradients. + + Run the correctness tests: + + ```bash + python rmsnorm_test.py + ``` + + Expected output: + + ``` + RMSNorm Correctness Tests + ============================================================ + + Test 1: Float32 + FP32 Forward — max diff: 9.54e-07 PASSED + FP32 Backward (dx) — max diff: 1.43e-06 PASSED + FP32 Backward (dw) — max diff: 2.29e-05 PASSED + + Test 2: BFloat16 (relaxed tolerance) + BF16 Forward — max diff: 1.56e-02 PASSED + BF16 Backward (dx) — max diff: 1.56e-02 PASSED + BF16 Backward (dw) — max diff: 5.00e-01 PASSED + + ============================================================ + All RMSNorm correctness tests PASSED + ``` + + The tests compare the custom kernel against PyTorch's reference `LlamaRMSNorm` at shapes matching LLaMA 3.1 8B (`batch=4, seq_len=512, hidden_size=4096`), testing both the forward output and the backward gradients for `x` and `weight`. + + > [!WARNING] + > BF16 has only 7 bits of mantissa (vs. 23 for FP32). Per-element differences of ~0.01-0.02 are normal for forward and `grad_x`. The **weight gradient** (`dw`) shows larger absolute differences (up to ~0.5) because it sums per-element contributions across all 2,048 token positions — different summation order between our FP32-accumulated kernel and PyTorch's autograd produces BF16 rounding differences that accumulate. The test uses relaxed tolerance for `dw` to account for this. + + The FP32 test uses tolerance `atol=1e-4` and the BF16 test uses `atol=1e-2` for per-element values, with a more relaxed threshold for the accumulated weight gradient. Both forward and backward must pass — many kernel bugs only manifest in the backward pass. + + # Step 7. Benchmark and re-profile RMSNorm + + Now let's measure the performance improvement. Run the RMSNorm benchmark: + + ```bash + python benchmark_kernels.py --kernel rmsnorm + ``` + + Example output: + + ``` + ====================================================================== + RMSNorm Benchmark — Custom Triton vs. PyTorch Reference + ====================================================================== + GPU: NVIDIA GB300 + + Tokens Custom (us) PyTorch (us) Custom (GB/s) PyTorch (GB/s) Speedup + -------- ------------- -------------- --------------- ---------------- --------- + 256 313.5 479.6 40 26 1.53x + 1,024 313.6 495.8 161 102 1.58x + 4,096 319.4 576.5 630 349 1.80x + 16,384 298.9 2,041.7 2,694 394 6.83x + ``` + + **How to read these results:** + - **Custom (GB/s)** shows effective memory bandwidth. On large inputs (16K tokens), the fused kernel reaches ~2,700 GB/s — significantly better than PyTorch's ~400 GB/s. + - **Speedup** ranges from ~1.5x on small inputs to ~6.8x on large inputs. The improvement grows with tensor size because PyTorch's unfused operations suffer more from memory round-trips as data grows, while the fused kernel's cost stays nearly flat until the data is large enough to saturate the GPU. + - These numbers measure **forward + backward combined**, which is what matters for training. + + Now re-profile the full training step with the custom RMSNorm to confirm the bottleneck is resolved: + + ```bash + python profile_baseline.py --use-custom-rmsnorm + ``` + + Compare the profiler output to Step 3. The `aten::pow`, `aten::mean`, and `aten::rsqrt` calls from RMSNorm should be gone, replaced by fewer, faster Triton kernel calls. The remaining top operations should be matrix multiplications and FlashAttention — operations already handled by highly optimized libraries. + + # Step 8. Write the fused cross-entropy Triton kernel + + Now for the more complex kernel. Open `cross_entropy_kernel.py`: + + ```bash + cat cross_entropy_kernel.py + ``` + + This implements the online softmax algorithm from Step 4 as a Triton kernel. The structure mirrors the RMSNorm kernel (forward kernel, backward kernel, autograd function, nn.Module), but the forward kernel is more complex because it loops over the vocabulary in chunks. + + **The forward kernel, annotated:** + + ```python + @triton.jit + def _cross_entropy_fwd_kernel(Logits_ptr, Targets_ptr, Losses_ptr, Max_ptr, Denom_ptr, + vocab_size, stride_logits, BLOCK_SIZE: tl.constexpr): + row_idx = tl.program_id(0) + ... + m = float("-inf") # Running maximum logit + d = 0.0 # Running sum of exp(logit_i - m) + target_logit = 0.0 # Logit at the target index + + for start in range(0, vocab_size, BLOCK_SIZE): + ... + ``` + + Key differences from the RMSNorm kernel: + + - **Loop over vocabulary chunks.** The RMSNorm kernel loads the entire row at once (4096 elements fits in registers). The cross-entropy kernel can't do that — 128,256 vocabulary entries is too large. Instead, it processes `BLOCK_SIZE` elements at a time (e.g., 4096 per iteration, 32 iterations total). Triton unrolls this loop for efficiency. + + - **Running state across iterations.** The kernel maintains `m` (running max) and `d` (running sum-of-exp) across loop iterations. The update rule handles the rescaling when a new maximum is found: + + ```python + new_m = tl.maximum(m, chunk_max) + d = d * tl.exp(m - new_m) + tl.sum(tl.exp(logits_chunk - new_m), axis=0) + m = new_m + ``` + + The `d * tl.exp(m - new_m)` term rescales the previous sum to account for a potentially larger maximum. This is the core of the online softmax algorithm. + + - **No intermediate tensor allocation.** The standard approach would allocate a `[num_tokens, 128256]` tensor for the softmax output. This kernel only stores three scalars per row (`loss`, `m`, `d`) plus the target logit. + + **The backward kernel** also loops over the vocabulary in chunks. For each chunk, it computes `softmax(logit) = exp(logit - m) / d` using the saved `m` and `d` values, subtracts 1 at the target position, and writes the gradient. Like the forward kernel, it never materializes the full softmax vector. + + > [!TIP] + > This kernel is inspired by the [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) project from LinkedIn. Liger-Kernel also implements a more advanced variant called **Fused Linear Cross-Entropy** that fuses the final linear projection (`hidden_states @ lm_head_weight`) with the cross-entropy loss, computing logits chunk-by-chunk and never materializing them at all. This is even more memory-efficient but significantly more complex (it requires tiled matrix multiplication within the kernel). See the Next Steps section for pointers. + + # Step 9. Test cross-entropy for correctness + + Run the correctness and memory tests: + + ```bash + python cross_entropy_test.py + ``` + + Expected output: + + ``` + Cross-Entropy Correctness Tests + ============================================================ + + Test 1: Float32 + FP32 Loss — ref: 12.331120 custom: 12.331120 diff: 0.00e+00 PASSED + FP32 Gradient — max diff: 9.09e-13 PASSED + + Test 2: BFloat16 (relaxed tolerance) + BF16 Loss — ref: 12.250000 custom: 12.247243 diff: 2.76e-03 PASSED + BF16 Gradient — max diff: 2.98e-08 PASSED + + Memory Comparison + ------------------------------------------------------------ + Standard PyTorch CE — peak memory: 504.0 MB + Fused Triton CE — peak memory: 252.0 MB + Memory reduction: 2.0x + + ============================================================ + All cross-entropy tests PASSED + ``` + + The **memory comparison** shows that standard PyTorch cross-entropy allocates ~500 MB (for the softmax output and other intermediates), while the fused kernel uses ~250 MB. The 2x reduction measured here understates the real benefit: in the benchmark (Step 10), where memory is measured more precisely per-operation, the reduction is **~6x**. The larger benefit appears because the benchmark isolates just the cross-entropy overhead, while this test includes the base logit tensor allocation in both measurements. + + > [!NOTE] + > Cross-entropy involves `log(sum(exp(...)))`, which is numerically sensitive. The online softmax algorithm maintains stability through the running-max trick — subtracting the maximum logit before exponentiating prevents overflow. The loss values should match PyTorch within 1e-5 in FP32 or 1e-2 in BF16. + + # Step 10. Benchmark and re-profile cross-entropy + + Run the cross-entropy benchmark: + + ```bash + python benchmark_kernels.py --kernel cross_entropy + ``` + + Example output: + + ``` + ====================================================================== + Cross-Entropy Benchmark — Custom Triton (online softmax) vs. PyTorch + ====================================================================== + GPU: NVIDIA GB300 + Vocabulary size: 128,256 (LLaMA 3.1) + + Tokens Custom (us) PyTorch (us) Speedup Custom Mem (MB) PyTorch Mem (MB) Mem Reduction + -------- ------------- -------------- --------- ----------------- ------------------ --------------- + 128 311 220 0.71x 32 188 5.9x + 256 300 338 1.12x 63 378 6.0x + 512 306 676 2.21x 126 752 6.0x + 1,024 315 1,277 4.06x 251 1,506 6.0x + ``` + + **How to read these results:** + - **Speedup** grows from slower at 128 tokens (kernel launch overhead dominates) to **4x at 1,024 tokens**. The fused kernel has a higher fixed cost per row (looping over 128K vocabulary in chunks) but scales much better because it avoids the massive intermediate softmax allocation. + - **Memory reduction** (~6x): PyTorch allocates separate tensors for the logits, softmax output, and loss gradients. The fused kernel avoids the softmax intermediary. For 1,024 tokens, this saves over 1 GB of GPU memory — room for larger batches or longer sequences. + - At very small batch sizes (128 tokens), the fused kernel is **slower**. This is expected and normal — the overhead of the online softmax loop exceeds the cost of PyTorch's bulk computation at small scales. The crossover point is around 256 tokens. + + Now re-profile with both custom kernels active: + + ```bash + python profile_baseline.py --use-custom-rmsnorm --use-custom-ce + ``` + + The profiler output should now show matrix multiplications and FlashAttention as the dominant operations. The RMSNorm and cross-entropy bottlenecks from Step 3 have been eliminated. The remaining operations are already handled by cuBLAS and FlashAttention — the most highly optimized GPU libraries available. + + # Step 11. Run end-to-end fine-tuning with custom kernels + + Let's put it all together: run a real fine-tuning loop and measure the end-to-end impact. + + First, run the baseline (vanilla PyTorch): + + ```bash + python finetune_baseline.py + ``` + + Then run the optimized version with both custom kernels: + + ```bash + python finetune_optimized.py + ``` + + Example comparison: + + ``` + ====================================================================== + Baseline Results + ====================================================================== + Average time per step: 1.842 s + Average throughput: 278 tokens/sec + Peak GPU memory: 112.4 GB + + ====================================================================== + Optimized Results + ====================================================================== + Average time per step: 1.614 s + Average throughput: 317 tokens/sec + Peak GPU memory: 78.6 GB + ``` + + **How the custom kernels are integrated:** + + The `finetune_optimized.py` script uses the "surgical replacement" pattern to swap in custom kernels without modifying the model source code: + + ```python + # Walk the model tree and collect every LlamaRMSNorm for replacement. + # We collect first, then apply — modifying the tree during iteration is unsafe. + replacements = [] + for name, module in model.named_modules(): + if type(module).__name__ == "LlamaRMSNorm": + parts = name.split(".") + parent = model.get_submodule(".".join(parts[:-1])) if len(parts) > 1 else model + replacements.append((parent, parts[-1], module)) + + for parent, attr_name, old_module in replacements: + setattr(parent, attr_name, TritonRMSNorm.from_llama_rmsnorm(old_module)) + + # Use custom cross-entropy instead of the model's built-in loss + outputs = model(input_ids=input_ids) # Forward without computing loss + logits = outputs.logits[:, :-1, :].contiguous() + loss = custom_ce(logits, labels[:, 1:].contiguous()) + ``` + + This pattern — find modules by type, create optimized replacements, swap them in — is widely used in production inference and training optimization. + + > [!NOTE] + > **Amdahl's law in action.** An 8x faster RMSNorm does not make training 8x faster. If RMSNorm was 10% of total step time, making it 8x faster saves about 8.75% of total time. The cross-entropy memory reduction has an outsized impact because it frees GPU memory that enables larger batch sizes, which improves GPU utilization across *all* operations — including the matrix multiplications and attention that dominate the compute profile. + + # Step 12. Cleanup and next steps + + When you're finished, exit the container: + + ```bash + exit + ``` + + Since we used `--rm`, the container is automatically removed. Your source code and profiler traces are preserved in the `assets/` directory on your host machine. Model weights are cached in `~/.cache/huggingface/` on the host (via the volume mount). + + To remove the container image: + + > [!WARNING] + > This deletes the built Docker image. You'll need to rebuild it if you want to use it again. + + ```bash + docker rmi kernel-dev-ft + ``` + + To remove downloaded model weights cached by Hugging Face: + + ```bash + rm -rf ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B + ``` + + **Next steps:** + + You've profiled a real training workload, identified the bottlenecks, written custom Triton kernels to address them, and measured end-to-end improvements. Here's where to go next: + + - **Fused Linear Cross-Entropy** — The kernel we wrote takes pre-computed logits as input. A more advanced variant fuses the `lm_head` linear projection with the cross-entropy, computing logits chunk-by-chunk and never materializing the full `[B*T, V]` tensor at all. See [Liger-Kernel's FusedLinearCrossEntropy](https://github.com/linkedin/Liger-Kernel) for a production implementation. + - **Fused SwiGLU with backward** — The [Custom CUDA Kernel Development](https://build.nvidia.com/nvidia/station-kernel-dev) playbook covered inference-only SwiGLU. For training, you need the backward pass too. Triton makes this straightforward with the `torch.autograd.Function` pattern used in this playbook. + - **Liger-Kernel integration** — Instead of writing every kernel yourself, use [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) as a drop-in optimization: `pip install liger-kernel` and `apply_liger_kernel_to_llama()`. Compare its throughput against your hand-written kernels. + - **Larger batch sizes** — The memory freed by fused cross-entropy allows increasing batch size. Re-profile with `--batch-size 2` or `--batch-size 4` to see how GPU utilization improves when more compute work is available per step. + - **LoRA fine-tuning** — Apply the same profiling methodology to LoRA/QLoRA fine-tuning. The bottleneck profile is different (fewer optimizer states, more activation memory relative to weights), which reveals different optimization opportunities. + - **Multi-GPU training** — The custom kernels work transparently with PyTorch FSDP and DDP. Each GPU runs its own copy of the kernel independently — no changes needed. + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + | Symptom | Cause | Fix | + |---------|-------|-----| + | `ModuleNotFoundError: No module named 'triton'` | Container missing Triton | Use the `kernel-dev-ft` container built from the playbook's Dockerfile. Triton ships with PyTorch NGC containers. Verify: `python -c "import triton; print(triton.__version__)"`. | + | `triton.compiler.errors.CompilationError` referencing `sm_100` | Triton version too old for Blackwell | Use PyTorch NGC container 26.01+ which includes Triton with Blackwell support. Check: `python -c "import triton; print(triton.__version__)"`. | + | Correctness test fails with large differences in BF16 | Using FP32 tolerance for BF16 comparison | BF16 has only 7 mantissa bits. Use `atol=1e-2, rtol=1e-2` for `torch.allclose`. Differences up to ~0.01 are normal. | + | `torch.cuda.OutOfMemoryError` during baseline profiling | Batch size or sequence length too large | Reduce `--batch-size` or `--seq-len` in `profile_baseline.py`. LLaMA 3.1 8B in BF16 needs ~16 GB for weights alone, plus ~32 GB for AdamW optimizer states. | + | `torch.cuda.OutOfMemoryError` during PyTorch cross-entropy but NOT during custom kernel | Standard cross-entropy materializes full `[B*T, V]` logit tensor | This demonstrates exactly why the custom kernel is needed. Reduce batch size or sequence length for the baseline comparison, or run only the custom kernel path. | + | Profiler trace JSON is very large (>1 GB) | Too many training steps profiled | Reduce `wait`, `warmup`, `active` in the profiler schedule. The default script profiles only 1 active step. | + | `401 Client Error` when downloading LLaMA 3.1 8B | Missing or invalid Hugging Face token, or no LLaMA access | Set `HF_TOKEN` environment variable. Accept the LLaMA 3.1 license at `https://huggingface.co/meta-llama/Llama-3.1-8B`. Verify token: `huggingface-cli whoami`. | + | Custom RMSNorm backward produces NaN gradients | Epsilon value too small or input contains extreme values | Ensure epsilon is `1e-6` (LLaMA default). Check input tensor for NaN/Inf with `torch.isfinite(x).all()`. | + | Benchmark shows no speedup for RMSNorm on small hidden dimensions | Kernel launch overhead dominates for small tensors | RMSNorm speedup is most visible at `hidden_size >= 2048`. LLaMA 3.1 8B uses 4096, which is well above the threshold. | + | `docker: Error response from daemon: could not select device driver` | NVIDIA Container Toolkit not installed or Docker not restarted | Install: `sudo apt install nvidia-container-toolkit && sudo systemctl restart docker`. Verify: `docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi`. | + | Fused cross-entropy loss differs from PyTorch by more than 0.1 | Bug in the chunked online softmax implementation | Verify the running-max update: `m_new = max(m_old, chunk_max)` must happen BEFORE updating the running sum-of-exp `d`. Check that the target index masking uses the correct chunk offset. | + | Fine-tuning throughput is not improved despite faster kernels | GPU is compute-bound on matmuls, not bandwidth-bound on norms/loss | This is expected if batch size is large enough that matmuls dominate. The primary benefit is memory reduction (enabling larger batches or longer sequences) rather than pure latency. | + | `ImportError: cannot import name 'LlamaForCausalLM'` | `transformers` library version too old | Update: `pip install --upgrade transformers>=4.45.0`. The container's Dockerfile pins a compatible version. | + | Chrome trace file won't open in browser | Trace file too large for `chrome://tracing` | Use [Perfetto UI](https://ui.perfetto.dev/) instead, which handles larger traces. Or reduce the number of profiled steps. | + + + + + resources: + - name: Triton Language Documentation + url: https://triton-lang.org/ + + + - name: PyTorch Profiler Documentation + url: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html + + + - name: Liger-Kernel (reference implementations) + url: https://github.com/linkedin/Liger-Kernel + + + - name: Blackwell Architecture Tuning Guide + url: https://docs.nvidia.com/cuda/blackwell-tuning-guide/ + + diff --git a/nvidia/station-kernel-dev-ft/endpoint-test.yaml b/nvidia/station-kernel-dev-ft/endpoint-test.yaml index 2d899f5..accfee2 100644 --- a/nvidia/station-kernel-dev-ft/endpoint-test.yaml +++ b/nvidia/station-kernel-dev-ft/endpoint-test.yaml @@ -80,6 +80,7 @@ spec: **Software:** - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu24.04 nvidia-smi` + - On a DGX Station, immediately confirm which device index belongs to the GB300 so later steps can target it explicitly. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` and note the index for the row showing `NVIDIA GB300`. Subsequent steps recommend `--gpus '"device=N"'` (with `N` = that index) instead of `--gpus all` so profiling and benchmark numbers stay on a single, known GPU. - Network access to pull container images from NGC and download model weights from Hugging Face. - A Hugging Face account with access to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and a [Hugging Face access token](https://huggingface.co/settings/tokens). @@ -100,14 +101,14 @@ spec: # Time & risk - * **Estimated time:** About 2 hours. Steps 1-4 (setup through baseline profiling) take about 30 minutes. Steps 5-7 (RMSNorm kernel) take about 30 minutes. Steps 8-10 (cross-entropy kernel) take about 40 minutes. Step 11 (end-to-end integration) takes about 20 minutes. + * **Estimated time:** About 2 hours. Steps 1-4 (setup through baseline profiling) take about 30 minutes. Steps 5-7 (RMSNorm kernel) take about 30 minutes. Steps 8-10 (cross-entropy kernel) take about 40 minutes. Step 11 (end-to-end integration) takes about 20 minutes. Steps 12-13 (cleanup and next steps) are a few minutes. * **Risk level:** Low * All work runs inside a Docker container — no host system modifications. * LLaMA 3.1 8B model weights (~16 GB in BF16) are downloaded from Hugging Face on first run and cached locally. * Requires a Hugging Face token with access to the LLaMA 3.1 model. * **Rollback:** Exit the container. Your source files are preserved in the mounted `assets/` directory; everything else is discarded. - * **Last Updated:** 03/30/2026 - * First Publication + * **Last Updated:** 05/26/2026 + * First publication @@ -133,12 +134,22 @@ spec: docker build -t kernel-dev-ft . ``` + Identify the GB300's device index so the container can target it explicitly. On multi-GPU DGX Station systems, pinning to a single, known GPU keeps profiling and benchmark numbers consistent across runs: + + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + + Look for the row showing `NVIDIA GB300` and note its index (commonly `0` or `1`). Use that value as `N` in the next command. + Start the container with GPU access. Pass your Hugging Face token so the container can download LLaMA 3.1 8B: ```bash + # Replace N with the GB300 index from the command above. + # On a single-GPU Station you may substitute --gpus all. docker run -it --rm \ --name kernel-dev-ft \ - --gpus all \ + --gpus '"device=N"' \ --ipc host \ -e HF_TOKEN=$HF_TOKEN \ -v "$(pwd):/workspace" \ @@ -150,6 +161,9 @@ spec: > [!NOTE] > The `-v "$(pwd):/workspace"` flag mounts the current directory into the container. Any files you create or modify inside `/workspace` persist on your host machine after the container exits. The `-v ~/.cache/huggingface:/root/.cache/huggingface` mount persists downloaded model weights across container restarts so you don't need to re-download the 16 GB model each time. Everything outside these mounted paths is discarded when the container stops. + > [!IMPORTANT] + > Targeting the GB300 explicitly with `--gpus '"device=N"'` (rather than `--gpus all`) ensures `torch.cuda` and `nvidia-smi` inside the container both see the **GB300** as device `0`. Profiling and benchmark numbers later in this playbook assume a single Blackwell GPU; mixing a workstation GPU in via `--gpus all` can change scheduling and skew tokens/sec and bandwidth utilization figures. + > [!NOTE] > If you haven't set `HF_TOKEN` in your shell, export it first: `export HF_TOKEN=hf_your_token_here`. You need a Hugging Face token with access to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B). You must first accept the LLaMA 3.1 Community License Agreement on the [model page](https://huggingface.co/meta-llama/Llama-3.1-8B) before your token can download the weights. @@ -164,7 +178,7 @@ spec: Expected output should show: - Triton version 3.0 or later - PyTorch with CUDA support enabled - - Your Blackwell GPU with **compute capability 10.0** (this identifier is shared by all Blackwell GPUs — GB200, GB300, B200, B300) + - A Blackwell GPU with a `10.x` compute capability. The exact minor version depends on the SKU — for example, `nvidia-smi` reports **`10.0`** on GB200 / B200 (standard Blackwell) and **`10.3`** on GB300 / B300 (Blackwell Ultra; same GPU silicon, different host packaging). Any `10.x` value is fine for this playbook; the kernels target the Blackwell family, not a specific minor. > [!NOTE] > Unlike the CUDA C++ workflow (which requires the `nvcc` compiler and a separate compilation step), Triton is a Python library that JIT-compiles GPU code at runtime. There is no build step — you write Python, and Triton compiles it to optimized GPU machine code when you first call the kernel. @@ -202,6 +216,13 @@ spec: 2. **Large vocabularies create massive intermediate tensors.** LLaMA's 128K vocabulary means the logit tensor for a single batch is enormous. Standard cross-entropy materializes this entire tensor in memory. 3. **Memory is the binding constraint.** Unlike inference (where latency matters most), training is often limited by how much data fits in GPU memory. Kernels that reduce memory enable larger batch sizes, which improve GPU utilization across *all* operations. + **Memory-bound vs compute-bound (where to spend effort):** + + - **Memory-bound** regions are limited by how fast you can move bytes through HBM (read/write bandwidth). Symptoms: small kernels, low achieved GB/s vs peak, profiler shows many narrow ops or fusion gaps. **Optimize** by fusing passes, reducing tensor materialization, using narrower dtypes where safe, and improving coalescing so each byte does more useful math. + - **Compute-bound** regions are limited by arithmetic throughput (Tensor Cores, FP32/FP16 units). Symptoms: large `aten::mm` / matmul and attention dominating self CUDA time with high utilization. **Optimize** with better tiling, larger batch sizes (more work per launch), kernels that keep math in registers, and libraries (cuBLAS, FlashAttention) before hand-writing alternatives. + + A single training step usually mixes both: matmuls tend toward **compute-bound** on large batches, while pointwise norm/loss paths are often **memory-bound**. Profiling tells you which bucket your hotspot falls into. + # Step 3. Profile a baseline training step Now let's see where GPU time actually goes. We'll use `torch.profiler` to capture a detailed trace of a single forward + backward + optimizer step. @@ -215,6 +236,15 @@ spec: > [!NOTE] > The first run downloads LLaMA 3.1 8B weights (~16 GB in BF16) from Hugging Face. This takes several minutes depending on network speed. Subsequent runs use the cached weights and start immediately. + > [!NOTE] + > **Repeat runs:** `profile_baseline.py` removes any prior trace directory and Chrome JSON for the same flags before recording, so you can re-run baseline profiling without a "trace is already saved" error. + + > [!NOTE] + > **Ranking variance:** The exact ordering and percentages in the "Top 20 CUDA operations" table can change between runs, PyTorch / CUDA versions, and GPU generation. You should still see the same *categories* of work (matmuls, FlashAttention, RMSNorm decompositions, cross-entropy). **"Command Buffer Full"** (or similar) sometimes appears at the top of self-time tables: that reflects the GPU driver's **submission queue / scheduling**, not a user kernel to optimize. The script filters that row from the printed table; the raw trace in Perfetto still contains the underlying kernels. + + > [!TIP] + > **Optional Nsight Systems timeline:** For a visual timeline with CUDA API and GPU work (outside or alongside `torch.profiler`), install [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) and run something like: `nsys profile -o llama_ft_repro --trace=cuda,nvtx python profile_baseline.py` from an environment where `nsys` is on `PATH` (often the host, or a devel image with CUDA toolkit). Open the `.nsys-rep` file in the Nsight Systems GUI. + The script loads LLaMA 3.1 8B, runs one training step under `torch.profiler`, and prints a table like this: ``` @@ -416,11 +446,22 @@ spec: 16,384 298.9 2,041.7 2,694 394 6.83x ``` - **How to read these results:** - - **Custom (GB/s)** shows effective memory bandwidth. On large inputs (16K tokens), the fused kernel reaches ~2,700 GB/s — significantly better than PyTorch's ~400 GB/s. - - **Speedup** ranges from ~1.5x on small inputs to ~6.8x on large inputs. The improvement grows with tensor size because PyTorch's unfused operations suffer more from memory round-trips as data grows, while the fused kernel's cost stays nearly flat until the data is large enough to saturate the GPU. + **How to read these results (what "better" means):** + + | Column / metric | Better when… | + |-----------------|---------------| + | **Custom (us)** | **Lower** is faster (fewer microseconds per forward+backward pass). | + | **PyTorch (us)** | Reference only; same rule (lower is faster). | + | **Custom (GB/s)** | **Higher** means you move closer to HBM peak (more useful bytes per second for this fused region). | + | **Speedup** | **Higher** means the custom kernel beats PyTorch by a larger factor on that row. | + + - **Custom (GB/s)** shows effective memory bandwidth. On large inputs (16K tokens), the fused kernel typically reaches much higher GB/s than the unfused PyTorch path. + - **Speedup** often ranges from roughly **1.5x** on small inputs to **6x+** on large inputs in internal runs. - These numbers measure **forward + backward combined**, which is what matters for training. + > [!NOTE] + > **Treat the table as illustrative, not a target.** Absolute microsecond and GB/s values **can differ by an order of magnitude** between GB300 stacks (different driver versions, NGC PyTorch builds, clock states, autograd overhead between iterations). On the validation run for this playbook the same shapes measured ~4,000–5,000 µs per fwd+bwd instead of ~300 µs, while still showing **custom faster than PyTorch and the gap widening with `num_tokens`**. The **direction of the speedup** (and the GB/s ratio between custom and PyTorch in the same run) is the stable signal — match those, and your kernel is healthy. + Now re-profile the full training step with the custom RMSNorm to confirm the bottleneck is resolved: ```bash @@ -496,7 +537,7 @@ spec: Test 2: BFloat16 (relaxed tolerance) BF16 Loss — ref: 12.250000 custom: 12.247243 diff: 2.76e-03 PASSED - BF16 Gradient — max diff: 2.98e-08 PASSED + BF16 Gradient — max diff (fp32 compare): 1.23e-01 PASSED Memory Comparison ------------------------------------------------------------ @@ -511,7 +552,10 @@ spec: The **memory comparison** shows that standard PyTorch cross-entropy allocates ~500 MB (for the softmax output and other intermediates), while the fused kernel uses ~250 MB. The 2x reduction measured here understates the real benefit: in the benchmark (Step 10), where memory is measured more precisely per-operation, the reduction is **~6x**. The larger benefit appears because the benchmark isolates just the cross-entropy overhead, while this test includes the base logit tensor allocation in both measurements. > [!NOTE] - > Cross-entropy involves `log(sum(exp(...)))`, which is numerically sensitive. The online softmax algorithm maintains stability through the running-max trick — subtracting the maximum logit before exponentiating prevents overflow. The loss values should match PyTorch within 1e-5 in FP32 or 1e-2 in BF16. + > Cross-entropy involves `log(sum(exp(...)))`, which is numerically sensitive. The online softmax algorithm maintains stability through the running-max trick — subtracting the maximum logit before exponentiating prevents overflow. FP32 checks use tight tolerances. **BF16** compares loss with relaxed `atol/rtol` and compares **gradients in float32** with wider tolerances (`atol=2e-1`, `rtol=2e-1`) so chunked reductions over 128K vocabulary do not false-fail against PyTorch's different accumulation order. + + > [!WARNING] + > BF16 tolerances are intentionally looser than FP32: they assert the custom kernel matches the reference **within training-usable error**, not bitwise. Tighten tolerances only if you change the algorithm or dtype strategy. # Step 10. Benchmark and re-profile cross-entropy @@ -538,10 +582,14 @@ spec: 1,024 315 1,277 4.06x 251 1,506 6.0x ``` - **How to read these results:** - - **Speedup** grows from slower at 128 tokens (kernel launch overhead dominates) to **4x at 1,024 tokens**. The fused kernel has a higher fixed cost per row (looping over 128K vocabulary in chunks) but scales much better because it avoids the massive intermediate softmax allocation. - - **Memory reduction** (~6x): PyTorch allocates separate tensors for the logits, softmax output, and loss gradients. The fused kernel avoids the softmax intermediary. For 1,024 tokens, this saves over 1 GB of GPU memory — room for larger batches or longer sequences. - - At very small batch sizes (128 tokens), the fused kernel is **slower**. This is expected and normal — the overhead of the online softmax loop exceeds the cost of PyTorch's bulk computation at small scales. The crossover point is around 256 tokens. + **How to read these results:** For latency columns, **lower microseconds is better**. For **Speedup**, **higher is better** (custom faster than PyTorch). For **Mem Reduction**, **higher is better** (more peak memory saved). + + - **Speedup** grows from slower at 128 tokens (kernel launch overhead dominates) to several times faster at 1,024 tokens in typical runs. + - **Memory reduction** (~6x in the table): PyTorch allocates separate tensors for the logits, softmax output, and loss gradients. The fused kernel avoids the softmax intermediary. For 1,024 tokens, this saves over 1 GB of GPU memory, room for larger batches or longer sequences. + - At very small token counts (128), the fused kernel can be **slower**. That is expected: the online softmax loop has fixed per-row overhead. The crossover is often near 256–1,024 tokens depending on stack. + + > [!NOTE] + > Same caveat as in Step 7: **absolute microseconds in the example table are illustrative**. On the validation run for this playbook the per-iteration latencies were several thousand µs rather than the ~300 µs printed above, while the **memory reduction** (~6x) and the **speedup direction** (fused becomes faster as `num_tokens` grows) remained stable. Trust the **shape** of the table and the **memory column**, not the absolute latency numbers. Now re-profile with both custom kernels active: @@ -567,24 +615,27 @@ spec: python finetune_optimized.py ``` - Example comparison: + Example comparison on **GB300** (default `--batch-size 1`, `--seq-len 512`; throughput is `batch * seq_len / step_time`; numbers below are **illustrative**, not a target): ``` ====================================================================== Baseline Results ====================================================================== - Average time per step: 1.842 s - Average throughput: 278 tokens/sec + Average time per step: 0.201 s + Average throughput: 2540 tokens/sec (illustrative) Peak GPU memory: 112.4 GB ====================================================================== - Optimized Results + Optimized Results (illustrative) ====================================================================== - Average time per step: 1.614 s - Average throughput: 317 tokens/sec + Average time per step: 0.194 s + Average throughput: 2640 tokens/sec (illustrative) Peak GPU memory: 78.6 GB ``` + > [!NOTE] + > **Treat the throughput numbers above as illustrative, not a target** — same caveat as the RMSNorm (Step 7) and cross-entropy (Step 10) benchmark notes. Absolute tok/s and step time **vary** with GPU generation, clocks, PyTorch / CUDA builds, and whether the warm-up pass included JIT; older runs near **~280 tok/s** were observed on different stacks. The **stable signals** are (1) the **relative gap** — optimized > baseline — and (2) the **peak GPU memory delta** (the cross-entropy memory reduction is what frees room for larger batch sizes). Match those, not the absolute tok/s. + **How the custom kernels are integrated:** The `finetune_optimized.py` script uses the "surgical replacement" pattern to swap in custom kernels without modifying the model source code: @@ -613,7 +664,7 @@ spec: > [!NOTE] > **Amdahl's law in action.** An 8x faster RMSNorm does not make training 8x faster. If RMSNorm was 10% of total step time, making it 8x faster saves about 8.75% of total time. The cross-entropy memory reduction has an outsized impact because it frees GPU memory that enables larger batch sizes, which improves GPU utilization across *all* operations — including the matrix multiplications and attention that dominate the compute profile. - # Step 12. Cleanup and next steps + # Step 12. Cleanup When you're finished, exit the container: @@ -638,16 +689,16 @@ spec: rm -rf ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B ``` - **Next steps:** + # Step 13. Next steps - You've profiled a real training workload, identified the bottlenecks, written custom Triton kernels to address them, and measured end-to-end improvements. Here's where to go next: + You profiled a real training workload, identified bottlenecks, shipped custom Triton kernels, and measured end-to-end impact. Continue from here with: - - **Fused Linear Cross-Entropy** — The kernel we wrote takes pre-computed logits as input. A more advanced variant fuses the `lm_head` linear projection with the cross-entropy, computing logits chunk-by-chunk and never materializing the full `[B*T, V]` tensor at all. See [Liger-Kernel's FusedLinearCrossEntropy](https://github.com/linkedin/Liger-Kernel) for a production implementation. - - **Fused SwiGLU with backward** — The [Custom CUDA Kernel Development](https://build.nvidia.com/nvidia/station-kernel-dev) playbook covered inference-only SwiGLU. For training, you need the backward pass too. Triton makes this straightforward with the `torch.autograd.Function` pattern used in this playbook. - - **Liger-Kernel integration** — Instead of writing every kernel yourself, use [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) as a drop-in optimization: `pip install liger-kernel` and `apply_liger_kernel_to_llama()`. Compare its throughput against your hand-written kernels. - - **Larger batch sizes** — The memory freed by fused cross-entropy allows increasing batch size. Re-profile with `--batch-size 2` or `--batch-size 4` to see how GPU utilization improves when more compute work is available per step. - - **LoRA fine-tuning** — Apply the same profiling methodology to LoRA/QLoRA fine-tuning. The bottleneck profile is different (fewer optimizer states, more activation memory relative to weights), which reveals different optimization opportunities. - - **Multi-GPU training** — The custom kernels work transparently with PyTorch FSDP and DDP. Each GPU runs its own copy of the kernel independently — no changes needed. + - **Fused Linear Cross-Entropy:** The kernel in this playbook takes pre-computed logits. A more advanced variant fuses the `lm_head` linear projection with the cross-entropy so logits are produced chunk-by-chunk and the full `[B*T, V]` tensor is never stored. See [Liger-Kernel's FusedLinearCrossEntropy](https://github.com/linkedin/Liger-Kernel). + - **Fused SwiGLU with backward:** The [Custom CUDA Kernel Development](https://build.nvidia.com/nvidia/station-kernel-dev) playbook covered inference-only SwiGLU. Training needs the backward pass; use the same `torch.autograd.Function` pattern as here. + - **Liger-Kernel integration:** `pip install liger-kernel` and `apply_liger_kernel_to_llama()`, then compare throughput to your hand-written kernels. + - **Larger batch sizes:** Fused cross-entropy frees memory. Re-profile with `--batch-size 2` or `--batch-size 4` to see utilization when more matmul work sits behind each step. + - **LoRA fine-tuning:** Re-run the profiling methodology on LoRA or QLoRA. Bottlenecks shift (fewer optimizer states, different activation pressure). + - **Multi-GPU training:** These kernels compose with FSDP and DDP unchanged (each rank runs its own Triton programs). @@ -660,7 +711,8 @@ spec: |---------|-------|-----| | `ModuleNotFoundError: No module named 'triton'` | Container missing Triton | Use the `kernel-dev-ft` container built from the playbook's Dockerfile. Triton ships with PyTorch NGC containers. Verify: `python -c "import triton; print(triton.__version__)"`. | | `triton.compiler.errors.CompilationError` referencing `sm_100` | Triton version too old for Blackwell | Use PyTorch NGC container 26.01+ which includes Triton with Blackwell support. Check: `python -c "import triton; print(triton.__version__)"`. | - | Correctness test fails with large differences in BF16 | Using FP32 tolerance for BF16 comparison | BF16 has only 7 mantissa bits. Use `atol=1e-2, rtol=1e-2` for `torch.allclose`. Differences up to ~0.01 are normal. | + | Cross-entropy BF16 test fails on loss or gradient | BF16 + 128K vocab accumulate drift vs PyTorch's CE path | `cross_entropy_test.py` uses relaxed loss tolerances and compares **gradients in float32** with wider `atol/rtol`. If it still fails, check PyTorch / CUDA versions; file an issue with `torch.__version__`. | + | `RuntimeError: Trace is already saved` from profiler | Stale `traces/` directory from a previous run | Use the latest `profile_baseline.py` (it deletes the prior trace dir and Chrome JSON). Or run `rm -rf traces/trace traces/trace_*` before profiling. | | `torch.cuda.OutOfMemoryError` during baseline profiling | Batch size or sequence length too large | Reduce `--batch-size` or `--seq-len` in `profile_baseline.py`. LLaMA 3.1 8B in BF16 needs ~16 GB for weights alone, plus ~32 GB for AdamW optimizer states. | | `torch.cuda.OutOfMemoryError` during PyTorch cross-entropy but NOT during custom kernel | Standard cross-entropy materializes full `[B*T, V]` logit tensor | This demonstrates exactly why the custom kernel is needed. Reduce batch size or sequence length for the baseline comparison, or run only the custom kernel path. | | Profiler trace JSON is very large (>1 GB) | Too many training steps profiled | Reduce `wait`, `warmup`, `active` in the profiler schedule. The default script profiles only 1 active step. | @@ -693,3 +745,7 @@ spec: url: https://docs.nvidia.com/cuda/blackwell-tuning-guide/ + - name: NVIDIA Nsight Systems + url: https://developer.nvidia.com/nsight-systems + + diff --git a/nvidia/station-mig/README.md b/nvidia/station-mig/README.md index 1475c2c..ccd123a 100644 --- a/nvidia/station-mig/README.md +++ b/nvidia/station-mig/README.md @@ -309,4 +309,4 @@ After running `sudo nvidia-smi -mig 0`, confirm MIG is fully disabled: nvidia-smi -q | grep -A2 "MIG Mode" ``` -Expected output should show `Current: Disabled` for each GPU. If you still see MIG devices in `nvidia-smi -L`, destroy any remaining instances with `-dci`/`-dgi` per GPU, then run `-mig 0` again. +Expected output should show `Current: Disabled` for each GPU. If you still see MIG devices in `nvidia-smi -L`, destroy any remaining instances with `-dci`/`-dgi` per GPU, then run `-mig 0` again diff --git a/nvidia/station-mig/endpoint-test.yaml b/nvidia/station-mig/endpoint-test.yaml index 60b694e..8c592db 100644 --- a/nvidia/station-mig/endpoint-test.yaml +++ b/nvidia/station-mig/endpoint-test.yaml @@ -345,7 +345,7 @@ spec: nvidia-smi -q | grep -A2 "MIG Mode" ``` - Expected output should show `Current: Disabled` for each GPU. If you still see MIG devices in `nvidia-smi -L`, destroy any remaining instances with `-dci`/`-dgi` per GPU, then run `-mig 0` again. + Expected output should show `Current: Disabled` for each GPU. If you still see MIG devices in `nvidia-smi -L`, destroy any remaining instances with `-dci`/`-dgi` per GPU, then run `-mig 0` again diff --git a/nvidia/station-nanochat/endpoint-production.yaml b/nvidia/station-nanochat/endpoint-production.yaml index 84cae97..0da04f6 100644 --- a/nvidia/station-nanochat/endpoint-production.yaml +++ b/nvidia/station-nanochat/endpoint-production.yaml @@ -34,7 +34,7 @@ spec: cta: text: View on GitHub - url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-nanochat/ + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-nanochat/ tabs: @@ -81,7 +81,7 @@ spec: # Ancillary files - All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-station-playbooks](https://github.com/NVIDIA/dgx-station-playbooks) repository). + All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). - `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv. - `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image. @@ -128,8 +128,8 @@ spec: Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image. ```bash - git clone https://github.com/NVIDIA/dgx-station-playbooks.git - cd dgx-station-playbooks/nvidia/station-nanochat/assets + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-nanochat/assets ``` From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.). diff --git a/nvidia/station-nvfp4-pretraining/README.md b/nvidia/station-nvfp4-pretraining/README.md index a9857f8..f5a7917 100644 --- a/nvidia/station-nvfp4-pretraining/README.md +++ b/nvidia/station-nvfp4-pretraining/README.md @@ -80,12 +80,12 @@ python3 -c "import torch; print(torch.cuda.get_device_name(0))" ## Time & risk -* **Estimated duration**: 20-30 minutes (quick test loop with default `--train-iters 10`); longer for real data +* **Estimated duration**: 20-30 minutes (quick test loop with default `--train-iters 50`); longer for real data * **Risks**: * NVFP4 requires Blackwell GPUs — will fail on Hopper or older * Mock data is used by default (`eval_iters=0`); real data requires a preprocessed Megatron-format dataset * **Rollback**: Stop the `torchrun` process and remove any checkpoint directories -* **Last Updated:** 04/19/2026 +* **Last Updated:** 05/26/2026 * First Publication ## Pretrain with NVFP4 @@ -245,7 +245,7 @@ Then exit the container shell (`exit`) — the `--rm` flag in Step 1 deletes it |---------|-------|-----| | `RuntimeError: NVFP4 is not supported on this GPU` or similar FP4 error | GPU is not Blackwell architecture | NVFP4 requires Blackwell GPUs (GB200, GB300). Check with `nvidia-smi` | | `ModuleNotFoundError: No module named 'megatron.bridge'` | Megatron Bridge not installed | Run `pip install megatron-bridge` or use the NGC container | -| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 1B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | +| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | | `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible | | Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate | | `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` | diff --git a/nvidia/station-nvfp4-pretraining/endpoint-production.yaml b/nvidia/station-nvfp4-pretraining/endpoint-production.yaml new file mode 100644 index 0000000..7b38018 --- /dev/null +++ b/nvidia/station-nvfp4-pretraining/endpoint-production.yaml @@ -0,0 +1,302 @@ +kind: Playbook +metadata: + name: station-nvfp4-pretraining + displayName: NVFP4 Pretraining with Megatron Bridge + shortDescription: Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - Training + - Megatron Bridge + - NVFP4 + + attributes: + - key: DURATION + value: 30 MIN + +spec: + artifactName: station-nvfp4-pretraining + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + tabs: + - + id: overview + + label: Overview + content: | + # NVFP4 training + + NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores. + When applied during **pretraining**, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32). + + Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core. + It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class `bf16_with_nvfp4_mixed` recipe used in this playbook. + + Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off. + + Key benefits: + + - **~2× higher training throughput vs BF16** - Higher TFLOPs at minimal loss in model quality + - **Native Blackwell NVFP4 GEMMs** — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead + - **Recipe-based configuration** — swap between `bf16_mixed`, `bf16_with_fp8_current_scaling_mixed`, and `bf16_with_nvfp4_mixed` with a single line + - **Stability controls** — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via `first_last_layers_bf16`) + - **~2× memory reduction** - For inference weight storage vs FP8, ~3.5× vs FP16 + + # What you'll accomplish + + Pretrain a **Llama 3.1 8B** model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station. + You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the `--disable-fp4` flag and then learn how to point it at real data if required. + + # Measured results + + Run settings: + + - Model: Llama 3.1 8B (`llama3_8b_pretrain_config()`) + - 50 iterations, 2 warmup + - Global batch size 64, micro batch size 4, sequence length 4096 + - Dummy data (Megatron-Core's built-in `MockGPTDataset` — synthetic random token IDs, no real corpus) + - Single GB300 GPU, `nvcr.io/nvidia/nemo:26.04` container + - Latency: average of iterations 20–50 (iter 10 includes one-time CUDA-graph/compile overhead) + - VRAM: peak of `nvidia-smi --query-compute-apps=used_memory` sampled every 2 s during the run + + | Precision | Recipe | Avg step time | Throughput (Model TFLOP/s/GPU) | Peak VRAM | + |---|---|---|---|---| + | BF16 baseline | `bf16_mixed()` | 9.05 s | ~1399 | 221.6 GB | + | NVFP4 (last-4 BF16) | `bf16_with_nvfp4_mixed()` + `first_last_layers_bf16=True`, `num_layers_at_end_in_bf16=4` | **5.39 s** | **~2347** | **207.8 GB** | + + NVFP4 is **1.68× faster** than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections. + + # What to know before starting + + - Basic Python and PyTorch usage + - Familiarity with distributed training concepts (`torchrun`) + - Understanding of mixed precision training (FP16/BF16/FP8) + + # Prerequisites + + - NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip) + - Docker installed with GPU support + - NVIDIA Container Toolkit configured + - Megatron-Bridge installed (via the the NeMo Framework NGC container) + + Verify your setup: + + ```bash + # Check GPU availability and architecture + nvidia-smi + + # Verify Python and torch + python3 -c "import torch; print(torch.cuda.get_device_name(0))" + ``` + + # Time & risk + + * **Estimated duration**: 20-30 minutes (quick test loop with default `--train-iters 10`); longer for real data + * **Risks**: + * NVFP4 requires Blackwell GPUs — will fail on Hopper or older + * Mock data is used by default (`eval_iters=0`); real data requires a preprocessed Megatron-format dataset + * **Rollback**: Stop the `torchrun` process and remove any checkpoint directories + * **Last Updated:** 04/19/2026 + * First Publication + + + + - + id: instructions + + label: Pretrain with NVFP4 + content: | + # Step 1. Set up the environment + + The recommended way to run Megatron-Bridge on DGX Station is through the [NeMo Framework container](https://github.com/NVIDIA-NeMo/Megatron-Bridge#-nemo-framework-container), which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image. + + ```bash + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets + + # Use the latest nemo tag + export TAG=26.04 + + docker run --rm -it \ + --gpus all \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ + -v "$(pwd):/workdir" \ + -w /workdir \ + --entrypoint bash \ + nvcr.io/nvidia/nemo:${TAG} + ``` + + All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container** . + + # Step 2. Review the pretraining script + + The pretraining script can be found at `pretrain_llama.py`. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt `bf16_with_nvfp4_mixed` recipe: + + ```python + from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed + + def nvfp4_mixed_precision(): + cfg = bf16_with_nvfp4_mixed() + cfg.first_last_layers_bf16 = True + cfg.num_layers_at_start_in_bf16 = 0 + cfg.num_layers_at_end_in_bf16 = 4 + return cfg + ``` + + `bf16_with_nvfp4_mixed()` already sets `fp8="e4m3"` and `fp8_recipe="nvfp4"` under the hood; we just toggle the layer-pinning knobs on top: + - **Last 4 layers in BF16** (`num_layers_at_end_in_bf16=4`) for training stability (adjustable per model) + - **No start-layer pinning** (`num_layers_at_start_in_bf16=0`) — last-layer stability is usually enough + + > [!NOTE] + > The script uses `llama3_8b_pretrain_config()` which defaults to `context_parallel_size=2`. The script overrides this to `context_parallel_size=1` for single-GPU runs. If you swap in a larger recipe (e.g. `nemotron_3_nano_pretrain_config`, which defaults to TP=4), you **must** either launch `torchrun --nproc_per_node=4` on a 4-GPU node or override `config.model.tensor_model_parallel_size = 1` before calling `pretrain(...)`, or you will hit: + > `AssertionError: world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...)`. + + # Step 3. Launch NVFP4 pre-training + + Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards: + + ```bash + torchrun --nproc_per_node=1 pretrain_llama.py > nvfp4.log 2>&1 + ``` + + Expected output (see `nvfp4.log`): + - Model initialization logs and a `Theoretical memory footprints: weight and optimizer=...` line + - Iteration progress printed every step (`log_interval=1`), e.g. `iteration 10/50 | ... elapsed time per iteration (ms): ... | lm loss: ...` + - A `[Rank 0] ... memory (GB) | mem-max-reserved-gigabytes: ...` line — this is your peak VRAM + - A checkpoint saved to `/workdir/nemo_experiments/default/checkpoints` + + If the run finishes with `EXIT=0` (or no traceback), your NVFP4 pretraining setup is working. + + # Step 4. Compare with BF16 baseline + + Run the same script with `--disable-fp4` to establish a BF16 baseline, again logging to a file: + + ```bash + # Remove the prior checkpoint directory so the two runs don't interfere + rm -rf nemo_experiments + + torchrun --nproc_per_node=1 pretrain_llama.py --disable-fp4 > bf16.log 2>&1 + ``` + + To compare the two runs on **latency** and **throughput**, grep the per-iteration lines out of each log: + + ```bash + grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log + grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log + ``` + + Each step prints two lines: + - `Step Time : 5.39s GPU utilization: 2347.0MODEL_TFLOP/s/GPU` — step latency and throughput + - `iteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss: ...` — same latency in ms plus loss + + Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 20–50 for a fair per-step latency number. + + ### Measuring peak VRAM (from `nvidia-smi`) + + Megatron's in-log memory numbers (`mem-max-reserved-gigabytes`) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch `nvidia-smi` live from a second shell while training runs: + + ```bash + watch -n 1 nvidia-smi + ``` + + See the measured numbers in `overview.md` for expected VRAM and latency on 1× GB300 with Llama 3.1 8B. + + # Step 5. Script arguments + + `pretrain_llama.py` accepts the following arguments: + + | Argument | Type | Default | Description | + |----------|------|---------|-------------| + | `--disable-fp4` | flag | off | Disable NVFP4; use plain BF16 mixed precision as a baseline | + | `--train-iters` | int | 50 | Number of training iterations | + | `--warmup-iters` | int | 2 | Number of warmup iterations | + | `--global-batch-size` | int | 64 | Global batch size | + | `--micro-batch-size` | int | 4 | Micro batch size (drives peak VRAM; increase to use more memory) | + | `--seq-length` | int | 4096 | Sequence length | + + Example combining several flags: + + ```bash + torchrun --nproc_per_node=1 pretrain_llama.py \ + --train-iters 50 --warmup-iters 2 \ + --global-batch-size 64 --micro-batch-size 4 --seq-length 4096 + ``` + + # Step 6. Point to real data + + To train on your own dataset, modify the config in the script: + + ```python + config = llama3_8b_pretrain_config() + config.data.data_path = "/path/to/your/preprocessed/dataset" + config.train.train_iters = 5000 + config.train.global_batch_size = 256 + config.train.micro_batch_size = 2 + ``` + + Megatron-Bridge expects preprocessed data in Megatron format. See the [Megatron-Bridge data preparation guide](https://docs.nvidia.com/nemo/megatron-bridge/latest/) for details. + + # Step 7. Cleanup + + Remove checkpoints and log files generated by the runs: + + ```bash + rm -rf nemo_experiments/ nvfp4.log bf16.log + ``` + + Then exit the container shell (`exit`) — the `--rm` flag in Step 1 deletes it automatically. + + # References + + - Quickstart: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tutorials/recipes/llama/00_quickstart_pretrain.py + - Mixed precision: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/mixed-precision.html + - API: https://docs.nvidia.com/nemo/megatron-bridge/latest/apidocs/bridge/bridge.training.mixed_precision.html + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + | Symptom | Cause | Fix | + |---------|-------|-----| + | `RuntimeError: NVFP4 is not supported on this GPU` or similar FP4 error | GPU is not Blackwell architecture | NVFP4 requires Blackwell GPUs (GB200, GB300). Check with `nvidia-smi` | + | `ModuleNotFoundError: No module named 'megatron.bridge'` | Megatron Bridge not installed | Run `pip install megatron-bridge` or use the NGC container | + | `CUDA out of memory` during model init | Insufficient GPU memory for Llama 1B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | + | `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible | + | Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate | + | `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` | + | Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization | + | Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | + + + + + resources: + - name: Megatron Bridge Documentation + url: https://docs.nvidia.com/nemo/megatron-bridge/latest/ + + + - name: Mixed Precision Training Guide + url: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/mixed-precision.html + + + + - name: Megatron Bridge GitHub + url: https://github.com/NVIDIA-NeMo/Megatron-Bridge + + diff --git a/nvidia/station-nvfp4-pretraining/endpoint-test.yaml b/nvidia/station-nvfp4-pretraining/endpoint-test.yaml index 9f37176..8f4fbc6 100644 --- a/nvidia/station-nvfp4-pretraining/endpoint-test.yaml +++ b/nvidia/station-nvfp4-pretraining/endpoint-test.yaml @@ -101,12 +101,12 @@ spec: # Time & risk - * **Estimated duration**: 20-30 minutes (quick test loop with default `--train-iters 10`); longer for real data + * **Estimated duration**: 20-30 minutes (quick test loop with default `--train-iters 50`); longer for real data * **Risks**: * NVFP4 requires Blackwell GPUs — will fail on Hopper or older * Mock data is used by default (`eval_iters=0`); real data requires a preprocessed Megatron-format dataset * **Rollback**: Stop the `torchrun` process and remove any checkpoint directories - * **Last Updated:** 04/19/2026 + * **Last Updated:** 05/26/2026 * First Publication @@ -121,8 +121,8 @@ spec: The recommended way to run Megatron-Bridge on DGX Station is through the [NeMo Framework container](https://github.com/NVIDIA-NeMo/Megatron-Bridge#-nemo-framework-container), which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image. ```bash - git clone https://github.com/NVIDIA/dgx-station-playbooks/blob/main - cd dgx-station-playbooks/nvidia/station-nvfp4-pretraining/assets + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets # Use the latest nemo tag export TAG=26.04 @@ -276,7 +276,7 @@ spec: |---------|-------|-----| | `RuntimeError: NVFP4 is not supported on this GPU` or similar FP4 error | GPU is not Blackwell architecture | NVFP4 requires Blackwell GPUs (GB200, GB300). Check with `nvidia-smi` | | `ModuleNotFoundError: No module named 'megatron.bridge'` | Megatron Bridge not installed | Run `pip install megatron-bridge` or use the NGC container | - | `CUDA out of memory` during model init | Insufficient GPU memory for Llama 1B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | + | `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | | `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible | | Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate | | `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` | diff --git a/nvidia/station-sglang-inference/README.md b/nvidia/station-sglang-inference/README.md index 90a2a77..574413f 100644 --- a/nvidia/station-sglang-inference/README.md +++ b/nvidia/station-sglang-inference/README.md @@ -1,12 +1,15 @@ # LLM Inference with SGLang -> Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference +> Serve LLMs with SGLang on DGX Station (Qwen3-8B default; Qwen3.6 MoE optional)—prefix-cached multi-turn, structured output, benchmarks, and inference-server guidance ## Table of Contents - [Overview](#overview) - [Instructions](#instructions) + - [Example model IDs (`MODEL_HANDLE`)](#example-model-ids-modelhandle) + - [Choosing an inference backend (DGX Station)](#choosing-an-inference-backend-dgx-station) + - [Next steps: heavier models on Station](#next-steps-heavier-models-on-station) - [Troubleshooting](#troubleshooting) --- @@ -24,12 +27,13 @@ SGLang is a high-performance serving framework for large language models, optimi ## What you'll accomplish -Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput to see RadixAttention's effect. +Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput and interpret results together with server logs (wall time alone is not a reliable cache signal under parallel load). -- Serve Qwen3-8B with SGLang's Blackwell-optimized backend -- Send multi-turn conversations and observe prefix cache hits in server metrics +- Serve **Qwen3-8B** (`Qwen/Qwen3-8B` by default for fast first-run validation) or another checkpoint from the in-playbook model table — including the larger **Qwen3.6 MoE** (`Qwen/Qwen3.6-35B-A3B`) once the workflow is verified +- Send multi-turn conversations and observe prefix cache hits in **Docker logs** (`#cached-token`) - Generate structured JSON output using schema-constrained decoding -- Benchmark multi-turn throughput with and without prefix caching +- Benchmark multi-turn throughput; optional **single-conversation** run to reduce contention; full cache/metrics scrape written to a **log file** for review +- Optional next step: large MoE such as **DeepSeek-V4** on Station when your SGLang build and VRAM allow ## What to know before starting @@ -46,14 +50,14 @@ Launch SGLang on DGX Station to serve an LLM, then exercise its two key differen ## Ancillary files -- `assets/benchmark_multiturn.py` — Python script that benchmarks multi-turn conversation throughput and demonstrates structured output generation +- `assets/benchmark_multiturn.py` — Benchmarks multi-turn chat under parallel load, structured JSON output, and writes full `/server_info` + `/metrics` bodies to a detail log (terminal shows a short summary only) ## Time & risk -* **Duration:** 20–25 minutes (including model download) -* **Risks:** Model download requires HuggingFace authentication +* **Duration:** 20–30 minutes for the default `Qwen/Qwen3-8B`; 45–60 minutes if you switch to `Qwen/Qwen3.6-35B-A3B` (download + Blackwell CUDA-graph capture) +* **Risks:** Gated models (e.g., Llama 3.3) require HuggingFace authentication and license acceptance * **Rollback:** Stop and remove the container to restore state -* **Last Updated:** 04/06/2026 +* **Last Updated:** 05/26/2026 * First Publication ## Instructions @@ -70,17 +74,52 @@ newgrp docker ## Step 2. Set up environment variables ```bash -## HuggingFace token (required) -## Get a token from https://huggingface.co/settings/tokens -export HF_TOKEN="your_huggingface_token" +## HuggingFace token (only required for gated models such as Llama 3.3). +## Leave empty for public models like Qwen3-8B; for gated models get a token at +## https://huggingface.co/settings/tokens. +export HF_TOKEN="" -## Model to serve +## Model to serve (see **Example model IDs** below). +## Default uses Qwen3-8B for fast first-run validation (~10–15 min boot on Station). +## Switch to Qwen3.6-35B-A3B once the workflow is working end-to-end. export MODEL_HANDLE="Qwen/Qwen3-8B" ## Maximum context length export MAX_MODEL_LEN=8192 ``` +### Example model IDs (`MODEL_HANDLE`) + +Use any **Hugging Face text-generation or chat** checkpoint that your SGLang build supports. The table below lists common starting points on DGX Station; always check the model card for **license / gated access**, **VRAM**, and **context length**. + +| Model ID | Notes | +|----------|--------| +| `Qwen/Qwen3-8B` | **Default in this playbook.** Dense Qwen3 8B; ~16 GB download, fast warmup, ideal for validating the workflow end-to-end. | +| `Qwen/Qwen3.6-35B-A3B` | Qwen3.6 MoE (~3B active experts); strong quality per GPU hour on Blackwell. ~70 GB download; allow ~30–45 min to first request. | +| `Qwen/Qwen3.6-27B` | Dense Qwen3.6; higher VRAM than the MoE row above at equal batch settings. | +| `google/gemma-3-12b-it` | Popular Gemma 3 instruct (text + vision in full stack; chat API usage is typically text-only). | +| `google/gemma-3-27b-it` | Larger Gemma 3 instruct variant. | +| `meta-llama/Llama-3.3-70B-Instruct` | Llama 3.3 70B instruct (gated on Hugging Face; accept the license in the model card before download). | + +Heavyweight MoE (very large weights; confirm **SGLang version + GPU memory** before serving): + +| Model ID | Notes | +|----------|--------| +| `deepseek-ai/DeepSeek-V4-Flash` | DeepSeek-V4 family (MoE). Intended to showcase **large local models on Station**; expect long downloads, strict VRAM headroom, and possible extra flags per SGLang docs. | +| `deepseek-ai/DeepSeek-V4-Pro` | Larger V4 variant; only if you have sufficient GPU memory and a supported SGLang build. | + +### Choosing an inference backend (DGX Station) + +Several OpenAI-compatible servers run well on NVIDIA hardware. None is universally “best”—pick by workload shape and operational constraints. + +| Backend | Strengths | Typical “use this when…” | +|---------|-----------|---------------------------| +| **SGLang** | RadixAttention for shared-prefix workloads; strong structured / grammar decoding; active Blackwell + CUDA 13 paths. | Highly **multi-turn**, **RAG** (repeated system + documents), **agents**, or **schema-constrained** JSON at scale. | +| **vLLM** | MaturePagedAttention, broad model coverage, common default in examples. | You want a **well-trodden** OSS server with **maximum community recipes** and straightforward PagedAttention behavior. | +| **TensorRT-LLM** | NVIDIA-optimized kernels and quantization workflows for throughput-focused deployment. | You are **productionizing** on NVIDIA GPUs and can invest in **TensorRT-LLM export / engines** for peak throughput. | + +This playbook focuses on **SGLang**; consult each project’s documentation for model support matrices and quantization modes. + ## Step 3. Pull the SGLang container Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103): @@ -91,24 +130,26 @@ docker pull lmsysorg/sglang:latest-cu130 ## Step 4. Identify the GB300 GPU -On DGX Station with multiple GPUs, identify the GB300's device index: +Identify the GB300's device index: ```bash nvidia-smi --query-gpu=index,name --format=csv,noheader ``` -Look for the row showing `NVIDIA GB300`. Note its index (e.g., `1`). +Look for the row showing `NVIDIA GB300`. Note its index — on DGX Station the GB300 may be at index `0` or `1` depending on configuration. If `nvidia-smi` shows only a single GB300, you can simply use `--gpus all` in the next step. ## Step 5. Start SGLang server -Launch the SGLang server: +Launch the SGLang server. The flags below are tuned for GB300 (Blackwell SM103) — see notes after the command: ```bash -## Replace device=1 with your GB300's index from Step 4 +## Use --gpus all on a single-GPU Station, or --gpus '"device=N"' with the +## index from Step 4 if multiple GPUs are present. docker run -d \ --name sglang-server \ - --gpus '"device=1"' \ + --gpus all \ --ipc host \ + --cap-add SYS_NICE \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 30000:30000 \ @@ -119,9 +160,17 @@ docker run -d \ --host 0.0.0.0 \ --port 30000 \ --context-length $MAX_MODEL_LEN \ - --mem-fraction-static 0.85 + --mem-fraction-static 0.85 \ + --attention-backend flashinfer \ + --enable-cache-report ``` +> [!IMPORTANT] +> **Why these flags on GB300:** +> - `--attention-backend flashinfer` — the auto-selected `trtllm_mha` backend currently fails CUDA-graph capture on Blackwell SM103 with `buildNdTmaDescriptor` errors; `fa3` is also rejected (it requires SM ≤ 90). FlashInfer is the safe default. +> - `--cap-add SYS_NICE` — lets SGLang set NUMA affinity; otherwise the server logs a warning on every launch. +> - `--enable-cache-report` — populates `usage.prompt_tokens_details.cached_tokens` in OpenAI-style responses so the benchmark in Step 9 can report cached prefill tokens. + Check the server logs: ```bash @@ -137,7 +186,7 @@ INFO: Uvicorn running on http://0.0.0.0:30000 Press `Ctrl+C` to exit the log view. > [!NOTE] -> First launch downloads the model and compiles kernels. Subsequent starts are faster thanks to cached weights and compiled artifacts. +> First launch downloads the model and captures CUDA graphs. Plan for ~10–15 min for `Qwen/Qwen3-8B` and ~30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server is ready. Subsequent starts are faster thanks to cached weights and compiled artifacts. ## Step 6. Test basic inference @@ -189,15 +238,15 @@ curl -s http://localhost:30000/v1/chat/completions \ }' | python3 -m json.tool ``` -The second request reuses the KV cache from the shared prefix (system message + first user turn + assistant response), only computing attention for the new user message. This reduces first-token latency for follow-up turns. +The second request **reuses the KV cache for the shared prefix** (system message + first user turn + assistant response) via RadixAttention, so repeated **prefill** work on that prefix is avoided. **End-to-end HTTP latency can still go up** on later turns: the transcript is longer (more tokens to attend to even with cache hits on the prefix), each assistant reply adds decode work, and the client measures full request time—not prefill alone. -Check the cache hit rate in the server logs. SGLang logs each prefill batch with the number of cached tokens reused: +Check cache reuse in the server logs. SGLang logs each prefill batch with the number of cached tokens reused: ```bash docker logs sglang-server 2>&1 | grep "cached-token" | tail -10 ``` -Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. +Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. Treat that as the primary signal of prefix caching; wall-clock `curl` latency alone can be misleading. ## Step 8. Structured JSON output @@ -245,30 +294,72 @@ The response content is guaranteed to be valid JSON matching the provided schema ## Step 9. Benchmark multi-turn throughput -Run the included benchmark script to measure how prefix caching improves multi-turn latency. The script is in the `assets/` directory of this playbook. +This step uses `benchmark_multiturn.py` from this playbook's `assets/` directory. Clone (or download) the playbook repository first so the script is available locally: ```bash +git clone https://github.com/NVIDIA/dgx-spark-playbooks +cd dgx-station-playbooks/nvidia/station-sglang-inference +``` + +> [!TIP] +> If `git` is not available, download the repository as a ZIP from the [playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-sglang-inference/) and extract it. All commands below assume your working directory is the playbook root (`dgx-station-playbooks/nvidia/station-sglang-inference/`), so `assets/benchmark_multiturn.py` resolves correctly. + +The benchmark stress-tests the server with **parallel** conversations (default: 20) and reports **per-turn wall time**, token counts, and (when the API exposes it) **cached prefill tokens**. + +Install the `requests` dependency. The **virtualenv approach below is the preferred, default installation path** — it keeps the script's dependencies isolated from the system Python interpreter so you cannot accidentally damage Ubuntu's own Python packages. Ubuntu 24.04 on DGX Station does not ship `python3-venv` by default, so install it once before creating the virtualenv: + +```bash +sudo apt update && sudo apt install -y python3-venv python3 -m venv .venv && source .venv/bin/activate pip install requests ``` +If you cannot run `sudo apt install python3-venv` (for example, a locked-down host), the next safest option is a **per-user install** that still respects PEP 668: + +```bash +python3 -m pip install --user requests +``` + +> [!CAUTION] +> **Last-resort only — `--break-system-packages` can damage your system Python.** +> Ubuntu 24.04 ships an "externally managed" system Python (PEP 668). The `--break-system-packages` flag tells `pip` to ignore that guard and install into the system or per-user site-packages anyway. This can shadow or conflict with packages installed by `apt` and break system tooling that depends on them. Only use this command when both the **venv** and plain **`--user`** paths above are unavailable, and only if you are willing to take that risk on the host you are running on: +> ```bash +> python3 -m pip install --user --break-system-packages requests +> ``` + ```bash python3 assets/benchmark_multiturn.py \ --base-url http://localhost:30000 \ --model "$MODEL_HANDLE" \ --num-conversations 20 \ + --turns-per-conversation 5 \ + --cache-detail-file ./sglang_benchmark_cache_details.log +``` + +The script prints: +- **Median / P90 wall time per turn** — often **increases** as prompts grow and under parallel load; that does **not** contradict RadixAttention. +- **Median prompt tokens per turn** — should climb as history lengthens. +- **Median cached prefill tokens** (when returned in `usage`) — populated by `--enable-cache-report` (already set in Step 5); this is the primary cache signal from the OpenAI-style `usage` payload. +- A **short summary** of cache-related `/server_info` or `/metrics` lines; the **full** responses are written to `--cache-detail-file` (default `./sglang_benchmark_cache_details.log`) so you are not flooded with an unparsed metrics blob in the terminal. + +> [!NOTE] +> The Step 5 launch enables `--enable-cache-report` (which fills `usage.prompt_tokens_details.cached_tokens`) but does **not** enable the Prometheus `/metrics` endpoint, since cached-prefill data is already exposed through `usage` and the `docker logs` `#cached-token` lines. If `/metrics` returns `404`/empty in the detail log, that is expected — the benchmark's primary cache signals (`usage.prompt_tokens_details.cached_tokens` and Docker logs) still work. To populate `/metrics` as well, add `--enable-metrics` to the `sglang serve` invocation in Step 5 and restart the container. + +To isolate prefix-cache behavior from multi-client contention, rerun with a single conversation: + +```bash +python3 assets/benchmark_multiturn.py \ + --base-url http://localhost:30000 \ + --model "$MODEL_HANDLE" \ + --num-conversations 1 \ --turns-per-conversation 5 ``` -The script sends parallel multi-turn conversations and measures: -- **Per-turn latency** for turn 1 vs subsequent turns (shows prefix caching effect) -- **Total throughput** in tokens per second -- **Cache statistics** from server metrics +Always correlate behavior with **`docker logs`** (`#cached-token` lines) as in Step 7. -You should see latency decrease for later turns in each conversation as the shared prefix grows, demonstrating RadixAttention's cache reuse. +### Next steps: heavier models on Station -> [!TIP] -> If you downloaded this playbook as a zip, the `assets/` directory is already present. If you cloned the full repository, navigate to `nvidia/station-sglang-inference/` first. +To stress GPU memory and throughput after completing the steps above, point `MODEL_HANDLE` at a larger checkpoint (for example `deepseek-ai/DeepSeek-V4-Flash`), **lower** `--mem-fraction-static` if you hit OOM, and reduce `--context-length` until the server starts cleanly. Confirm your SGLang image version supports the architecture (see [SGLang documentation](https://docs.sglang.io/)) and accept any **gated** model licenses on Hugging Face before pulling weights. ## Step 10. Cleanup @@ -293,15 +384,21 @@ docker rmi lmsysorg/sglang:latest-cu130 |---------|-------|-----| | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker | -| `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | Using `--gpus all` on a multi-GPU system | Use `--gpus '"device=N"'` to target the GB300 specifically (check index with `nvidia-smi`) | -| "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running the docker command | +| `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | `--gpus '"device=N"'` index does not exist on this Station | Re-run `nvidia-smi --query-gpu=index,name --format=csv,noheader` and use the actual GB300 index, or `--gpus all` if there is only one GPU | +| `RuntimeError: ... buildNdTmaDescriptor ... Check failed: false` during CUDA-graph capture | Default `trtllm_mha` attention backend is incompatible with Blackwell SM103 | Pass `--attention-backend flashinfer` to `sglang serve` | +| `AssertionError: FlashAttention v3 Backend requires SM>=80 and SM<=90` | `--attention-backend fa3` selected on Blackwell (SM103) | Use `--attention-backend flashinfer` instead | +| `User lacks permission to set NUMA affinity ... try adding --cap-add SYS_NICE` warning | Docker dropped the `SYS_NICE` capability | Add `--cap-add SYS_NICE` to the `docker run` command | +| `python3 -m venv .venv` fails with `apt install python3.12-venv` hint | Ubuntu 24.04 ships without `python3-venv` | Run `sudo apt update && sudo apt install -y python3-venv` (or use `python3 -m pip install --user --break-system-packages requests`) | +| "Token is required" or 401 error | Missing HuggingFace token for a gated model | Export `HF_TOKEN` before running the docker command and accept the model license on huggingface.co | | Server exits with OOM error | Model too large for available GPU memory | Lower `--mem-fraction-static` (e.g., 0.7) or reduce `--context-length`. Check GPU memory with `nvidia-smi` | | `json_schema` response_format returns error | SGLang version too old | Ensure you are using `lmsysorg/sglang:latest-cu130`. Older versions may not support `json_schema` format | | Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the `latest-cu130` image tag. SM103 requires CUDA 13.0+ | -| Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=N"'` to select the GB300 specifically | -| Slow first request after server start | Kernel JIT compilation | First request triggers kernel compilation. Subsequent requests are fast. Wait ~30 seconds | -| Connection refused on port 30000 | Server still loading model | Check `docker logs sglang-server` — wait for the Uvicorn startup message | -| `/server_info` shows no cache stats | Endpoint may differ across versions | Try `curl http://localhost:30000/v1/models` to verify the server is responsive. Cache metrics may be under `/metrics` (requires `--enable-metrics` server flag) | +| Slow first request after server start | Kernel JIT + CUDA-graph capture | First launch can take 10–15 min for `Qwen/Qwen3-8B` and 30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server prints "fired up and ready to roll!". Subsequent requests are fast. | +| Connection refused on port 30000 | Server still loading model or capturing CUDA graphs | Check `docker logs sglang-server` — wait for the Uvicorn startup message and "The server is fired up and ready to roll!" | +| `Med cached prefill` column is `n/a` in the benchmark | OpenAI-style `cached_tokens` not enabled on the server | Add `--enable-cache-report` to `sglang serve` so `usage.prompt_tokens_details.cached_tokens` is populated | +| `/server_info` body floods the benchmark "cache highlights" output | Older `benchmark_multiturn.py` matched any line containing "cache" — including the single-line `/server_info` JSON | Use the version of `benchmark_multiturn.py` shipped with this playbook (it skips JSON blobs and lines longer than 200 chars); the full body is still saved to `--cache-detail-file` | +| Benchmark shows **higher** median latency on later turns | Expected under parallel load + longer transcripts | RadixAttention reduces repeated **prefill** on shared prefixes—use `docker logs` (`#cached-token`) and optionally `--num-conversations 1`. See Step 9 and `sglang_benchmark_cache_details.log` | +| `deepseek-ai/DeepSeek-V4-*` fails to load | Unsupported in this SGLang build or insufficient VRAM | Check [SGLang docs](https://docs.sglang.io/) for model support; try `DeepSeek-V4-Flash` before Pro; lower `--mem-fraction-static` and `--context-length` | > [!NOTE] -> On DGX Station, the GB300 is typically device 1 (device 0 is the RTX Pro 6000 workstation GPU). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader`. +> On DGX Station the GB300 may be at device `0` or `1` depending on configuration (some Stations also expose a workstation GPU at `0`). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` before launching the container. diff --git a/nvidia/station-sglang-inference/endpoint-production.yaml b/nvidia/station-sglang-inference/endpoint-production.yaml new file mode 100644 index 0000000..4d35b66 --- /dev/null +++ b/nvidia/station-sglang-inference/endpoint-production.yaml @@ -0,0 +1,364 @@ +kind: Playbook +metadata: + name: station-sglang-inference + displayName: LLM Inference with SGLang + shortDescription: Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - GB300 + - Inference + - SGLang + - Blackwell + - Structured Output + - RadixAttention + + attributes: + - key: DURATION + value: 25 MIN + +spec: + artifactName: station-sglang-inference + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: View on GitHub + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-sglang-inference/ + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + SGLang is a high-performance serving framework for large language models, optimized for workloads where requests share common prefixes — multi-turn conversations, RAG pipelines, and agentic workflows. Its core innovation, **RadixAttention**, automatically caches and reuses KV cache entries across requests using a radix tree, eliminating redundant prefill computation. SGLang also provides best-in-class **structured output generation** (JSON, regex, grammar-constrained decoding) through its xGrammar backend, running up to 3x faster than standard guided decoding. + + - **RadixAttention** — Automatically reuses KV cache across requests sharing common prefixes. Multi-turn conversations and repeated system prompts skip re-computation entirely, reducing first-token latency and increasing throughput. + - **Structured output** — Compressed finite-state machine decoding with grammar mask generation overlapped with the LLM forward pass. Produces valid JSON, regex-matched, or grammar-constrained output with minimal overhead. + - **OpenAI-compatible API** — Drop-in replacement for OpenAI and vLLM endpoints. Supports `/v1/chat/completions`, `/v1/completions`, and `/v1/embeddings`. + - **Blackwell optimized** — SGLang includes optimizations for SM100+ GPUs and CUDA 13, with NVFP4 GEMM support and accelerated softmax kernels. + + # What you'll accomplish + + Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput to see RadixAttention's effect. + + - Serve Qwen3-8B with SGLang's Blackwell-optimized backend + - Send multi-turn conversations and observe prefix cache hits in server metrics + - Generate structured JSON output using schema-constrained decoding + - Benchmark multi-turn throughput with and without prefix caching + + # What to know before starting + + - Basic Docker container usage + - Familiarity with REST APIs (curl or Python requests) + + # Prerequisites + + - NVIDIA DGX Station with GB300 GPU (Blackwell SM103) + - Docker installed: `docker --version` + - NVIDIA Container Toolkit configured: `nvidia-smi` should show the GB300 + - HuggingFace account with access token + - Network access to HuggingFace and Docker Hub + + # Ancillary files + + - `assets/benchmark_multiturn.py` — Python script that benchmarks multi-turn conversation throughput and demonstrates structured output generation + + # Time & risk + + * **Duration:** 20–25 minutes (including model download) + * **Risks:** Model download requires HuggingFace authentication + * **Rollback:** Stop and remove the container to restore state + * **Last Updated:** 04/06/2026 + * First Publication + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Set up Docker permissions + + If you haven't already, add your user to the docker group to run Docker without sudo: + + ```bash + sudo usermod -aG docker $USER + newgrp docker + ``` + + # Step 2. Set up environment variables + + ```bash + # HuggingFace token (required) + # Get a token from https://huggingface.co/settings/tokens + export HF_TOKEN="your_huggingface_token" + + # Model to serve + export MODEL_HANDLE="Qwen/Qwen3-8B" + + # Maximum context length + export MAX_MODEL_LEN=8192 + ``` + + # Step 3. Pull the SGLang container + + Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103): + + ```bash + docker pull lmsysorg/sglang:latest-cu130 + ``` + + # Step 4. Identify the GB300 GPU + + On DGX Station with multiple GPUs, identify the GB300's device index: + + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + + Look for the row showing `NVIDIA GB300`. Note its index (e.g., `1`). + + # Step 5. Start SGLang server + + Launch the SGLang server: + + ```bash + # Replace device=1 with your GB300's index from Step 4 + docker run -d \ + --name sglang-server \ + --gpus '"device=1"' \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 30000:30000 \ + -e HF_TOKEN="$HF_TOKEN" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + lmsysorg/sglang:latest-cu130 \ + sglang serve --model-path "$MODEL_HANDLE" \ + --host 0.0.0.0 \ + --port 30000 \ + --context-length $MAX_MODEL_LEN \ + --mem-fraction-static 0.85 + ``` + + Check the server logs: + + ```bash + docker logs -f sglang-server + ``` + + Wait for the server to show it is ready: + + ``` + INFO: Uvicorn running on http://0.0.0.0:30000 + ``` + + Press `Ctrl+C` to exit the log view. + + > [!NOTE] + > First launch downloads the model and compiles kernels. Subsequent starts are faster thanks to cached weights and compiled artifacts. + + # Step 6. Test basic inference + + Send a chat completion request using the OpenAI-compatible API: + + ```bash + curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "'"$MODEL_HANDLE"'", + "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}], + "max_tokens": 256 + }' + ``` + + The response follows the standard OpenAI format with a `choices` array containing the model's answer. + + # Step 7. Multi-turn conversation with prefix caching + + SGLang's RadixAttention automatically caches the KV cache for processed tokens. When follow-up messages share the same conversation prefix, the cached entries are reused — skipping prefill for all previously seen tokens. + + Send a multi-turn conversation. The system prompt is deliberately long so the shared prefix exceeds SGLang's page size (64 tokens), which is the minimum unit for cache reuse: + + ```bash + # Turn 1 + curl -s http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "'"$MODEL_HANDLE"'", + "messages": [ + {"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."}, + {"role": "user", "content": "What is the difference between speed and velocity?"} + ], + "max_tokens": 256 + }' | python3 -m json.tool + + # Turn 2 — extends the same conversation + curl -s http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "'"$MODEL_HANDLE"'", + "messages": [ + {"role": "system", "content": "You are an expert physics tutor who explains concepts clearly and concisely. You use real-world analogies and everyday examples to make abstract ideas concrete. When answering, first state the key concept in one sentence, then give a short explanation with an example."}, + {"role": "user", "content": "What is the difference between speed and velocity?"}, + {"role": "assistant", "content": "Speed is a scalar quantity that measures how fast an object moves, while velocity is a vector quantity that includes both speed and direction. For example, a car driving at 60 km/h has a speed of 60 km/h regardless of where it is headed. But if that car is driving 60 km/h north, that is its velocity — change direction to south and the velocity changes even though the speed stays the same."}, + {"role": "user", "content": "Can you give me another example that shows why the distinction matters in real physics problems?"} + ], + "max_tokens": 256 + }' | python3 -m json.tool + ``` + + The second request reuses the KV cache from the shared prefix (system message + first user turn + assistant response), only computing attention for the new user message. This reduces first-token latency for follow-up turns. + + Check the cache hit rate in the server logs. SGLang logs each prefill batch with the number of cached tokens reused: + + ```bash + docker logs sglang-server 2>&1 | grep "cached-token" | tail -10 + ``` + + Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. + + # Step 8. Structured JSON output + + SGLang's constrained decoding guarantees valid JSON output matching a schema. This uses the xGrammar backend to overlap grammar mask generation with the model's forward pass, adding minimal latency. + + Generate a structured response: + + ```bash + curl -s http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "'"$MODEL_HANDLE"'", + "messages": [ + {"role": "user", "content": "List three programming languages with their primary use case and year created."} + ], + "max_tokens": 512, + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "languages", + "schema": { + "type": "object", + "properties": { + "languages": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "primary_use": {"type": "string"}, + "year_created": {"type": "integer"} + }, + "required": ["name", "primary_use", "year_created"] + } + } + }, + "required": ["languages"] + } + } + } + }' | python3 -m json.tool + ``` + + The response content is guaranteed to be valid JSON matching the provided schema. Parse the `choices[0].message.content` field — it will contain a well-formed JSON object. + + # Step 9. Benchmark multi-turn throughput + + Run the included benchmark script to measure how prefix caching improves multi-turn latency. The script is in the `assets/` directory of this playbook. + + ```bash + python3 -m venv .venv && source .venv/bin/activate + pip install requests + ``` + + ```bash + python3 assets/benchmark_multiturn.py \ + --base-url http://localhost:30000 \ + --model "$MODEL_HANDLE" \ + --num-conversations 20 \ + --turns-per-conversation 5 + ``` + + The script sends parallel multi-turn conversations and measures: + - **Per-turn latency** for turn 1 vs subsequent turns (shows prefix caching effect) + - **Total throughput** in tokens per second + - **Cache statistics** from server metrics + + You should see latency decrease for later turns in each conversation as the shared prefix grows, demonstrating RadixAttention's cache reuse. + + > [!TIP] + > If you downloaded this playbook as a zip, the `assets/` directory is already present. If you cloned the full repository, navigate to `nvidia/station-sglang-inference/` first. + + # Step 10. Cleanup + + Stop and remove the container: + + ```bash + docker stop sglang-server + docker rm sglang-server + ``` + + Optionally remove the image: + + ```bash + docker rmi lmsysorg/sglang:latest-cu130 + ``` + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + # Common issues + + | Symptom | Cause | Fix | + |---------|-------|-----| + | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | + | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker | + | `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | Using `--gpus all` on a multi-GPU system | Use `--gpus '"device=N"'` to target the GB300 specifically (check index with `nvidia-smi`) | + | "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running the docker command | + | Server exits with OOM error | Model too large for available GPU memory | Lower `--mem-fraction-static` (e.g., 0.7) or reduce `--context-length`. Check GPU memory with `nvidia-smi` | + | `json_schema` response_format returns error | SGLang version too old | Ensure you are using `lmsysorg/sglang:latest-cu130`. Older versions may not support `json_schema` format | + | Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the `latest-cu130` image tag. SM103 requires CUDA 13.0+ | + | Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=N"'` to select the GB300 specifically | + | Slow first request after server start | Kernel JIT compilation | First request triggers kernel compilation. Subsequent requests are fast. Wait ~30 seconds | + | Connection refused on port 30000 | Server still loading model | Check `docker logs sglang-server` — wait for the Uvicorn startup message | + | `/server_info` shows no cache stats | Endpoint may differ across versions | Try `curl http://localhost:30000/v1/models` to verify the server is responsive. Cache metrics may be under `/metrics` (requires `--enable-metrics` server flag) | + + > [!NOTE] + > On DGX Station, the GB300 is typically device 1 (device 0 is the RTX Pro 6000 workstation GPU). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader`. + + + + + resources: + - name: SGLang (GitHub) + url: https://github.com/sgl-project/sglang + + + - name: SGLang Documentation + url: https://docs.sglang.io/ + + + - name: SGLang OpenAI API Reference + url: https://docs.sglang.io/basic_usage/openai_api_completions.html + + diff --git a/nvidia/station-sglang-inference/endpoint-test.yaml b/nvidia/station-sglang-inference/endpoint-test.yaml index 4d35b66..4d5ae71 100644 --- a/nvidia/station-sglang-inference/endpoint-test.yaml +++ b/nvidia/station-sglang-inference/endpoint-test.yaml @@ -2,7 +2,7 @@ kind: Playbook metadata: name: station-sglang-inference displayName: LLM Inference with SGLang - shortDescription: Serve LLMs with SGLang on DGX Station for prefix-cached multi-turn and structured output inference + shortDescription: Serve LLMs with SGLang on DGX Station (Qwen3-8B default; Qwen3.6 MoE optional)—prefix-cached multi-turn, structured output, benchmarks, and inference-server guidance publisher: nvidia description: | @@ -21,7 +21,7 @@ metadata: attributes: - key: DURATION - value: 25 MIN + value: 30 MIN spec: artifactName: station-sglang-inference @@ -54,12 +54,13 @@ spec: # What you'll accomplish - Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput to see RadixAttention's effect. + Launch SGLang on DGX Station to serve an LLM, then exercise its two key differentiators: prefix-cached multi-turn chat and structured JSON output generation. You will also benchmark multi-turn throughput and interpret results together with server logs (wall time alone is not a reliable cache signal under parallel load). - - Serve Qwen3-8B with SGLang's Blackwell-optimized backend - - Send multi-turn conversations and observe prefix cache hits in server metrics + - Serve **Qwen3-8B** (`Qwen/Qwen3-8B` by default for fast first-run validation) or another checkpoint from the in-playbook model table — including the larger **Qwen3.6 MoE** (`Qwen/Qwen3.6-35B-A3B`) once the workflow is verified + - Send multi-turn conversations and observe prefix cache hits in **Docker logs** (`#cached-token`) - Generate structured JSON output using schema-constrained decoding - - Benchmark multi-turn throughput with and without prefix caching + - Benchmark multi-turn throughput; optional **single-conversation** run to reduce contention; full cache/metrics scrape written to a **log file** for review + - Optional next step: large MoE such as **DeepSeek-V4** on Station when your SGLang build and VRAM allow # What to know before starting @@ -76,14 +77,14 @@ spec: # Ancillary files - - `assets/benchmark_multiturn.py` — Python script that benchmarks multi-turn conversation throughput and demonstrates structured output generation + - `assets/benchmark_multiturn.py` — Benchmarks multi-turn chat under parallel load, structured JSON output, and writes full `/server_info` + `/metrics` bodies to a detail log (terminal shows a short summary only) # Time & risk - * **Duration:** 20–25 minutes (including model download) - * **Risks:** Model download requires HuggingFace authentication + * **Duration:** 20–30 minutes for the default `Qwen/Qwen3-8B`; 45–60 minutes if you switch to `Qwen/Qwen3.6-35B-A3B` (download + Blackwell CUDA-graph capture) + * **Risks:** Gated models (e.g., Llama 3.3) require HuggingFace authentication and license acceptance * **Rollback:** Stop and remove the container to restore state - * **Last Updated:** 04/06/2026 + * **Last Updated:** 05/26/2026 * First Publication @@ -105,17 +106,52 @@ spec: # Step 2. Set up environment variables ```bash - # HuggingFace token (required) - # Get a token from https://huggingface.co/settings/tokens - export HF_TOKEN="your_huggingface_token" + # HuggingFace token (only required for gated models such as Llama 3.3). + # Leave empty for public models like Qwen3-8B; for gated models get a token at + # https://huggingface.co/settings/tokens. + export HF_TOKEN="" - # Model to serve + # Model to serve (see **Example model IDs** below). + # Default uses Qwen3-8B for fast first-run validation (~10–15 min boot on Station). + # Switch to Qwen3.6-35B-A3B once the workflow is working end-to-end. export MODEL_HANDLE="Qwen/Qwen3-8B" # Maximum context length export MAX_MODEL_LEN=8192 ``` + ## Example model IDs (`MODEL_HANDLE`) + + Use any **Hugging Face text-generation or chat** checkpoint that your SGLang build supports. The table below lists common starting points on DGX Station; always check the model card for **license / gated access**, **VRAM**, and **context length**. + + | Model ID | Notes | + |----------|--------| + | `Qwen/Qwen3-8B` | **Default in this playbook.** Dense Qwen3 8B; ~16 GB download, fast warmup, ideal for validating the workflow end-to-end. | + | `Qwen/Qwen3.6-35B-A3B` | Qwen3.6 MoE (~3B active experts); strong quality per GPU hour on Blackwell. ~70 GB download; allow ~30–45 min to first request. | + | `Qwen/Qwen3.6-27B` | Dense Qwen3.6; higher VRAM than the MoE row above at equal batch settings. | + | `google/gemma-3-12b-it` | Popular Gemma 3 instruct (text + vision in full stack; chat API usage is typically text-only). | + | `google/gemma-3-27b-it` | Larger Gemma 3 instruct variant. | + | `meta-llama/Llama-3.3-70B-Instruct` | Llama 3.3 70B instruct (gated on Hugging Face; accept the license in the model card before download). | + + Heavyweight MoE (very large weights; confirm **SGLang version + GPU memory** before serving): + + | Model ID | Notes | + |----------|--------| + | `deepseek-ai/DeepSeek-V4-Flash` | DeepSeek-V4 family (MoE). Intended to showcase **large local models on Station**; expect long downloads, strict VRAM headroom, and possible extra flags per SGLang docs. | + | `deepseek-ai/DeepSeek-V4-Pro` | Larger V4 variant; only if you have sufficient GPU memory and a supported SGLang build. | + + ## Choosing an inference backend (DGX Station) + + Several OpenAI-compatible servers run well on NVIDIA hardware. None is universally “best”—pick by workload shape and operational constraints. + + | Backend | Strengths | Typical “use this when…” | + |---------|-----------|---------------------------| + | **SGLang** | RadixAttention for shared-prefix workloads; strong structured / grammar decoding; active Blackwell + CUDA 13 paths. | Highly **multi-turn**, **RAG** (repeated system + documents), **agents**, or **schema-constrained** JSON at scale. | + | **vLLM** | MaturePagedAttention, broad model coverage, common default in examples. | You want a **well-trodden** OSS server with **maximum community recipes** and straightforward PagedAttention behavior. | + | **TensorRT-LLM** | NVIDIA-optimized kernels and quantization workflows for throughput-focused deployment. | You are **productionizing** on NVIDIA GPUs and can invest in **TensorRT-LLM export / engines** for peak throughput. | + + This playbook focuses on **SGLang**; consult each project’s documentation for model support matrices and quantization modes. + # Step 3. Pull the SGLang container Pull the SGLang container image with CUDA 13.0 support (required for Blackwell SM103): @@ -126,24 +162,26 @@ spec: # Step 4. Identify the GB300 GPU - On DGX Station with multiple GPUs, identify the GB300's device index: + Identify the GB300's device index: ```bash nvidia-smi --query-gpu=index,name --format=csv,noheader ``` - Look for the row showing `NVIDIA GB300`. Note its index (e.g., `1`). + Look for the row showing `NVIDIA GB300`. Note its index — on DGX Station the GB300 may be at index `0` or `1` depending on configuration. If `nvidia-smi` shows only a single GB300, you can simply use `--gpus all` in the next step. # Step 5. Start SGLang server - Launch the SGLang server: + Launch the SGLang server. The flags below are tuned for GB300 (Blackwell SM103) — see notes after the command: ```bash - # Replace device=1 with your GB300's index from Step 4 + # Use --gpus all on a single-GPU Station, or --gpus '"device=N"' with the + # index from Step 4 if multiple GPUs are present. docker run -d \ --name sglang-server \ - --gpus '"device=1"' \ + --gpus all \ --ipc host \ + --cap-add SYS_NICE \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 30000:30000 \ @@ -154,9 +192,17 @@ spec: --host 0.0.0.0 \ --port 30000 \ --context-length $MAX_MODEL_LEN \ - --mem-fraction-static 0.85 + --mem-fraction-static 0.85 \ + --attention-backend flashinfer \ + --enable-cache-report ``` + > [!IMPORTANT] + > **Why these flags on GB300:** + > - `--attention-backend flashinfer` — the auto-selected `trtllm_mha` backend currently fails CUDA-graph capture on Blackwell SM103 with `buildNdTmaDescriptor` errors; `fa3` is also rejected (it requires SM ≤ 90). FlashInfer is the safe default. + > - `--cap-add SYS_NICE` — lets SGLang set NUMA affinity; otherwise the server logs a warning on every launch. + > - `--enable-cache-report` — populates `usage.prompt_tokens_details.cached_tokens` in OpenAI-style responses so the benchmark in Step 9 can report cached prefill tokens. + Check the server logs: ```bash @@ -172,7 +218,7 @@ spec: Press `Ctrl+C` to exit the log view. > [!NOTE] - > First launch downloads the model and compiles kernels. Subsequent starts are faster thanks to cached weights and compiled artifacts. + > First launch downloads the model and captures CUDA graphs. Plan for ~10–15 min for `Qwen/Qwen3-8B` and ~30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server is ready. Subsequent starts are faster thanks to cached weights and compiled artifacts. # Step 6. Test basic inference @@ -224,15 +270,15 @@ spec: }' | python3 -m json.tool ``` - The second request reuses the KV cache from the shared prefix (system message + first user turn + assistant response), only computing attention for the new user message. This reduces first-token latency for follow-up turns. + The second request **reuses the KV cache for the shared prefix** (system message + first user turn + assistant response) via RadixAttention, so repeated **prefill** work on that prefix is avoided. **End-to-end HTTP latency can still go up** on later turns: the transcript is longer (more tokens to attend to even with cache hits on the prefix), each assistant reply adds decode work, and the client measures full request time—not prefill alone. - Check the cache hit rate in the server logs. SGLang logs each prefill batch with the number of cached tokens reused: + Check cache reuse in the server logs. SGLang logs each prefill batch with the number of cached tokens reused: ```bash docker logs sglang-server 2>&1 | grep "cached-token" | tail -10 ``` - Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. + Look for `#cached-token` values greater than 0 on later turns — this confirms RadixAttention is reusing the KV cache from the shared prefix. Treat that as the primary signal of prefix caching; wall-clock `curl` latency alone can be misleading. # Step 8. Structured JSON output @@ -280,30 +326,72 @@ spec: # Step 9. Benchmark multi-turn throughput - Run the included benchmark script to measure how prefix caching improves multi-turn latency. The script is in the `assets/` directory of this playbook. + This step uses `benchmark_multiturn.py` from this playbook's `assets/` directory. Clone (or download) the playbook repository first so the script is available locally: ```bash + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-station-playbooks/nvidia/station-sglang-inference + ``` + + > [!TIP] + > If `git` is not available, download the repository as a ZIP from the [playbook repository](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-sglang-inference/) and extract it. All commands below assume your working directory is the playbook root (`dgx-station-playbooks/nvidia/station-sglang-inference/`), so `assets/benchmark_multiturn.py` resolves correctly. + + The benchmark stress-tests the server with **parallel** conversations (default: 20) and reports **per-turn wall time**, token counts, and (when the API exposes it) **cached prefill tokens**. + + Install the `requests` dependency. The **virtualenv approach below is the preferred, default installation path** — it keeps the script's dependencies isolated from the system Python interpreter so you cannot accidentally damage Ubuntu's own Python packages. Ubuntu 24.04 on DGX Station does not ship `python3-venv` by default, so install it once before creating the virtualenv: + + ```bash + sudo apt update && sudo apt install -y python3-venv python3 -m venv .venv && source .venv/bin/activate pip install requests ``` + If you cannot run `sudo apt install python3-venv` (for example, a locked-down host), the next safest option is a **per-user install** that still respects PEP 668: + + ```bash + python3 -m pip install --user requests + ``` + + > [!CAUTION] + > **Last-resort only — `--break-system-packages` can damage your system Python.** + > Ubuntu 24.04 ships an "externally managed" system Python (PEP 668). The `--break-system-packages` flag tells `pip` to ignore that guard and install into the system or per-user site-packages anyway. This can shadow or conflict with packages installed by `apt` and break system tooling that depends on them. Only use this command when both the **venv** and plain **`--user`** paths above are unavailable, and only if you are willing to take that risk on the host you are running on: + > ```bash + > python3 -m pip install --user --break-system-packages requests + > ``` + ```bash python3 assets/benchmark_multiturn.py \ --base-url http://localhost:30000 \ --model "$MODEL_HANDLE" \ --num-conversations 20 \ + --turns-per-conversation 5 \ + --cache-detail-file ./sglang_benchmark_cache_details.log + ``` + + The script prints: + - **Median / P90 wall time per turn** — often **increases** as prompts grow and under parallel load; that does **not** contradict RadixAttention. + - **Median prompt tokens per turn** — should climb as history lengthens. + - **Median cached prefill tokens** (when returned in `usage`) — populated by `--enable-cache-report` (already set in Step 5); this is the primary cache signal from the OpenAI-style `usage` payload. + - A **short summary** of cache-related `/server_info` or `/metrics` lines; the **full** responses are written to `--cache-detail-file` (default `./sglang_benchmark_cache_details.log`) so you are not flooded with an unparsed metrics blob in the terminal. + + > [!NOTE] + > The Step 5 launch enables `--enable-cache-report` (which fills `usage.prompt_tokens_details.cached_tokens`) but does **not** enable the Prometheus `/metrics` endpoint, since cached-prefill data is already exposed through `usage` and the `docker logs` `#cached-token` lines. If `/metrics` returns `404`/empty in the detail log, that is expected — the benchmark's primary cache signals (`usage.prompt_tokens_details.cached_tokens` and Docker logs) still work. To populate `/metrics` as well, add `--enable-metrics` to the `sglang serve` invocation in Step 5 and restart the container. + + To isolate prefix-cache behavior from multi-client contention, rerun with a single conversation: + + ```bash + python3 assets/benchmark_multiturn.py \ + --base-url http://localhost:30000 \ + --model "$MODEL_HANDLE" \ + --num-conversations 1 \ --turns-per-conversation 5 ``` - The script sends parallel multi-turn conversations and measures: - - **Per-turn latency** for turn 1 vs subsequent turns (shows prefix caching effect) - - **Total throughput** in tokens per second - - **Cache statistics** from server metrics + Always correlate behavior with **`docker logs`** (`#cached-token` lines) as in Step 7. - You should see latency decrease for later turns in each conversation as the shared prefix grows, demonstrating RadixAttention's cache reuse. + ## Next steps: heavier models on Station - > [!TIP] - > If you downloaded this playbook as a zip, the `assets/` directory is already present. If you cloned the full repository, navigate to `nvidia/station-sglang-inference/` first. + To stress GPU memory and throughput after completing the steps above, point `MODEL_HANDLE` at a larger checkpoint (for example `deepseek-ai/DeepSeek-V4-Flash`), **lower** `--mem-fraction-static` if you hit OOM, and reduce `--context-length` until the server starts cleanly. Confirm your SGLang image version supports the architecture (see [SGLang documentation](https://docs.sglang.io/)) and accept any **gated** model licenses on Hugging Face before pulling weights. # Step 10. Cleanup @@ -333,18 +421,24 @@ spec: |---------|-------|-----| | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker | - | `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | Using `--gpus all` on a multi-GPU system | Use `--gpus '"device=N"'` to target the GB300 specifically (check index with `nvidia-smi`) | - | "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running the docker command | + | `device >= 0 && device < num_gpus INTERNAL ASSERT FAILED` | `--gpus '"device=N"'` index does not exist on this Station | Re-run `nvidia-smi --query-gpu=index,name --format=csv,noheader` and use the actual GB300 index, or `--gpus all` if there is only one GPU | + | `RuntimeError: ... buildNdTmaDescriptor ... Check failed: false` during CUDA-graph capture | Default `trtllm_mha` attention backend is incompatible with Blackwell SM103 | Pass `--attention-backend flashinfer` to `sglang serve` | + | `AssertionError: FlashAttention v3 Backend requires SM>=80 and SM<=90` | `--attention-backend fa3` selected on Blackwell (SM103) | Use `--attention-backend flashinfer` instead | + | `User lacks permission to set NUMA affinity ... try adding --cap-add SYS_NICE` warning | Docker dropped the `SYS_NICE` capability | Add `--cap-add SYS_NICE` to the `docker run` command | + | `python3 -m venv .venv` fails with `apt install python3.12-venv` hint | Ubuntu 24.04 ships without `python3-venv` | Run `sudo apt update && sudo apt install -y python3-venv` (or use `python3 -m pip install --user --break-system-packages requests`) | + | "Token is required" or 401 error | Missing HuggingFace token for a gated model | Export `HF_TOKEN` before running the docker command and accept the model license on huggingface.co | | Server exits with OOM error | Model too large for available GPU memory | Lower `--mem-fraction-static` (e.g., 0.7) or reduce `--context-length`. Check GPU memory with `nvidia-smi` | | `json_schema` response_format returns error | SGLang version too old | Ensure you are using `lmsysorg/sglang:latest-cu130`. Older versions may not support `json_schema` format | | Server starts but CUDA errors on inference | Wrong CUDA version for Blackwell | Use the `latest-cu130` image tag. SM103 requires CUDA 13.0+ | - | Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=N"'` to select the GB300 specifically | - | Slow first request after server start | Kernel JIT compilation | First request triggers kernel compilation. Subsequent requests are fast. Wait ~30 seconds | - | Connection refused on port 30000 | Server still loading model | Check `docker logs sglang-server` — wait for the Uvicorn startup message | - | `/server_info` shows no cache stats | Endpoint may differ across versions | Try `curl http://localhost:30000/v1/models` to verify the server is responsive. Cache metrics may be under `/metrics` (requires `--enable-metrics` server flag) | + | Slow first request after server start | Kernel JIT + CUDA-graph capture | First launch can take 10–15 min for `Qwen/Qwen3-8B` and 30–45 min for `Qwen/Qwen3.6-35B-A3B` before the server prints "fired up and ready to roll!". Subsequent requests are fast. | + | Connection refused on port 30000 | Server still loading model or capturing CUDA graphs | Check `docker logs sglang-server` — wait for the Uvicorn startup message and "The server is fired up and ready to roll!" | + | `Med cached prefill` column is `n/a` in the benchmark | OpenAI-style `cached_tokens` not enabled on the server | Add `--enable-cache-report` to `sglang serve` so `usage.prompt_tokens_details.cached_tokens` is populated | + | `/server_info` body floods the benchmark "cache highlights" output | Older `benchmark_multiturn.py` matched any line containing "cache" — including the single-line `/server_info` JSON | Use the version of `benchmark_multiturn.py` shipped with this playbook (it skips JSON blobs and lines longer than 200 chars); the full body is still saved to `--cache-detail-file` | + | Benchmark shows **higher** median latency on later turns | Expected under parallel load + longer transcripts | RadixAttention reduces repeated **prefill** on shared prefixes—use `docker logs` (`#cached-token`) and optionally `--num-conversations 1`. See Step 9 and `sglang_benchmark_cache_details.log` | + | `deepseek-ai/DeepSeek-V4-*` fails to load | Unsupported in this SGLang build or insufficient VRAM | Check [SGLang docs](https://docs.sglang.io/) for model support; try `DeepSeek-V4-Flash` before Pro; lower `--mem-fraction-static` and `--context-length` | > [!NOTE] - > On DGX Station, the GB300 is typically device 1 (device 0 is the RTX Pro 6000 workstation GPU). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader`. + > On DGX Station the GB300 may be at device `0` or `1` depending on configuration (some Stations also expose a workstation GPU at `0`). Always verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` before launching the container. diff --git a/nvidia/station-topic-modeling/endpoint-production.yaml b/nvidia/station-topic-modeling/endpoint-production.yaml index 40d17aa..884d04e 100644 --- a/nvidia/station-topic-modeling/endpoint-production.yaml +++ b/nvidia/station-topic-modeling/endpoint-production.yaml @@ -32,7 +32,8 @@ spec: cta: text: View on GitHub - url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-topic-modeling/ + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-topic-modeling/ + tabs: @@ -196,8 +197,8 @@ spec: Clone the playbook repository and download the Amazon Electronics Reviews dataset. ```bash - git clone https://github.com/NVIDIA/dgx-station-playbooks - cd dgx-station-playbooks/nvidia/station-topic-modeling/assets + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-topic-modeling/assets ``` Download the dataset (~14GB compressed): @@ -272,8 +273,8 @@ spec: # Optional: remove Hugging Face cache (embedding cache from the notebook) rm -rf ~/.cache/huggingface - # From the parent of dgx-station-playbooks/, remove the cloned repo - rm -rf dgx-station-playbooks/ + # From the parent of dgx-spark-playbooks/, remove the cloned repo + rm -rf dgx-spark-playbooks/ ``` # Next steps diff --git a/nvidia/station-txt2kg/endpoint-production.yaml b/nvidia/station-txt2kg/endpoint-production.yaml index c6126a4..76413f8 100644 --- a/nvidia/station-txt2kg/endpoint-production.yaml +++ b/nvidia/station-txt2kg/endpoint-production.yaml @@ -35,7 +35,7 @@ spec: cta: text: View on GitHub - url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-txt2kg/ + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-txt2kg/ tabs: @@ -115,8 +115,8 @@ spec: This playbook is for **DGX Station**. In a terminal, clone the repository and navigate to the project directory. ```bash - git clone https://github.com/NVIDIA/dgx-station-playbooks - cd dgx-station-playbooks/nvidia/station-txt2kg/assets + git clone https://github.com/NVIDIA/dgx-spark-playbooks + cd dgx-spark-playbooks/nvidia/station-txt2kg/assets ``` # Step 2. Start the txt2kg services