dgx-spark-playbooks/nvidia/station-gr00t/endpoint-test.yaml
2026-05-27 16:00:20 +00:00

659 lines
36 KiB
YAML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

kind: Playbook
metadata:
name: station-gr00t
displayName: Isaac GR00T N1.6 Fine-Tuning
shortDescription: Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX Station
- GB300
- Robotics
- Isaac GR00T
- Fine-Tuning
- Blackwell
- VLA
attributes:
- key: DURATION
value: 45 MIN
spec:
artifactName: station-gr00t
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
cta:
text: View on GitHub
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-gr00t/
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning.
High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo:
![GR00T N1.6 reference architecture](./assets/GR00T-reference-arch-diagram.png)
*Source: [NVIDIA Isaac GR00T — `media/GR00T-reference-arch-diagram.png`](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/media/GR00T-reference-arch-diagram.png). If the local image above is missing, the upstream copy is at `https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png`.*
In this playbook you will fine-tune GR00T N1.6 on the **LIBERO Spatial** benchmark on a **DGX Station** with **GB300** (large unified memory). That setup supports a high **global batch size (128)** on a single GPU, which improves training throughput compared to typical 2480 GB consumer or datacenter GPUs.
# LIBERO Spatial (what you are fine-tuning on)
**LIBERO Spatial** is part of the [LIBERO](https://libero-project.github.io/main.html) suite of simulated tabletop manipulation benchmarks. The **spatial** split emphasizes **where** objects need to be placed: tasks such as putting a bowl on a **stove burner** vs a **plate**, placing utensils in a **mug** vs next to it, or moving objects to **left/right/front** targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots.
# What kind of fine-tuning this playbook uses
This playbook runs the **default Isaac GR00T fine-tuning recipe** from `launch_finetune.py`: **not** full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the **action head (DiT)** and **projector / adapter paths** that map observations into the action model, with strong **state dropout** and **color jitter** so the policy leans on vision. Optional flags such as `--tune-llm` or `--tune-visual` (mentioned under Next steps) trade compute and memory for updating more of the backbone. **LoRA** is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook.
# NVIDIA DGX Station (why this hardware)
**DGX Station** is a deskside AI system built for **large-memory GPU** training and inference (this playbook targets **GB300** with **284 GB HBM3e**). Beyond robotics, the same class of machine supports **large-model fine-tuning**, **RAG serving**, **multi-modal training**, and **CUDA research** where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting **much larger batch sizes** per GPU than on smaller cards, which stabilizes gradients and improves **samples per second** when the data pipeline keeps up.
# What you'll accomplish
- Check out the **`n1.6-release`** branch of Isaac GR00T so commands, embodiment tags, and `demo_data/` match GR00T **N1.6**
- Set up the environment with `uv` (project-local `.venv`) and understand what the optional `install_deps.sh` script changes on the system
- Apply the recommended **PyAV `get_frames_by_indices` patch** when `torchcodec` is unavailable so LIBERO **AV1** video decoding does not stall on an **ffmpeg** subprocess fallback
- Verify the base model, fine-tune on LIBERO Spatial at batch size **128**, run open-loop evaluation, and measure inference latency (with **GB300 / Blackwell** TorchDynamo compilation notes)
# What to know before starting
- Familiarity with Python virtual environments (`source .venv/bin/activate`)
- Familiarity with PyTorch training concepts (batch size, loss, checkpoints)
- Basic robot manipulation vocabulary (trajectories, observations, actions)
- Comfort running commands that may use **`sudo`** for system packages (or use the documented user-space alternative)
# Prerequisites
- NVIDIA **DGX Station** with **GB300** (Blackwell SM103, 284 GB HBM3e)
- CUDA toolkit usable by PyTorch: `nvcc --version` should show **CUDA 12.8+** (often already under `/usr/local/cuda` on DGX images)
- **Git** and **Git LFS** (`git lfs version`) — LFS is required for some demo assets and submodules; install with `sudo apt-get install -y git-lfs` then `git lfs install` if missing
- Hugging Face account and **HF_TOKEN** for model and dataset downloads
- Network access to Hugging Face, GitHub, and PyPI
- At least **~30 GB** free disk for `.venv`, checkpoints, and the LIBERO download
# Time & risk
* **Duration:** ~45 minutes end-to-end when the video backend is healthy (setup, downloads, ~2025 min training at 2000 steps, eval and inference)
* **Risks:** `scripts/deployment/dgpu/install_deps.sh` performs **system-level** `apt` operations and may install the **CUDA 12.8 toolkit** if `/usr/local/cuda` is absent (see Instructions). Model download requires Hugging Face authentication.
* **Rollback:** Remove the cloned `Isaac-GR00T` directory and optionally `rm -rf ~/.local/share/uv` if you want to reclaim `uv` caches. Reverting `apt`-installed packages is a separate admin task; the playbook does not uninstall them automatically.
* **Last Updated:** 05/26/2026
* First Publication
-
id: instructions
label: Instructions
content: |
# Step 1. Clone Isaac GR00T and install dependencies
## 1a. Git LFS (required for a clean clone)
If `git clone` fails with errors about **Git LFS** or missing pointer files, install and initialize LFS, then remove any partial `Isaac-GR00T` directory and clone again:
```bash
sudo apt-get update
sudo apt-get install -y git-lfs
git lfs install
```
## 1b. Clone and check out `n1.6-release`
The **`main`** branch tracks ongoing development (for example newer GR00T milestones) and **does not** always match this **N1.6** playbook. Embodiment tags such as **`GR1`**, paths like **`demo_data/gr1.PickNPlace`**, and tutorial scripts are aligned with the **`n1.6-release`** branch.
```bash
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive
```
## 1c. Install Python dependencies
### Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)
This script is the supported path. It may make **system-level** changes:
- Runs `apt-get update` and installs **`ffmpeg`** and **`libaio-dev`**
- If **`/usr/local/cuda`** is missing, adds the NVIDIA CUDA apt repository and installs **`cuda-toolkit-12-8`**
- Installs **`uv`** into your user account if needed, then runs **`uv sync`** and **`uv pip install -e .`** into the project **`.venv`**
- On **aarch64** only: installs FFmpeg **development** packages and **builds `torchcodec` from source** into `.venv`
```bash
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
```
### Option B — User-space only (no `install_deps.sh`)
Use this only when **CUDA 12.8+** is already installed, system **`ffmpeg`** / **`libaio-dev`** are already present, and your policy forbids the script's `apt` or CUDA steps. From the **Isaac-GR00T** repo root, install **`uv`** if needed, then:
```bash
command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH"
export CUDA_HOME=/usr/local/cuda
uv sync
uv pip install -e .
```
You still need a working **video backend** for LIBERO (see Step 2). On aarch64, building **torchcodec** inside `.venv` without the script is possible but manual; see Troubleshooting.
> [!IMPORTANT]
> **`PATH` and `CUDA_HOME` matter on multi-toolkit hosts.** If the system has both an old Ubuntu `nvidia-cuda-toolkit` package (`/usr/bin/nvcc` ≈ 12.0) and a current NVIDIA CUDA repo install (`/usr/local/cuda-13.x/bin/nvcc`), `uv` will pick whichever appears first on `PATH`. Putting `/usr/local/cuda/bin` first (and exporting `CUDA_HOME`) is required for `flash-attn`'s source build to find the matching toolkit. Verify with `nvcc --version` after the export.
> [!WARNING]
> **`flash-attn` build on aarch64 takes ~2 hours from source.** The upstream `pyproject.toml` only lists pre-built `flash-attn==2.7.4.post1` wheels for **`x86_64`**; on aarch64 (Grace + GB300), `uv sync` falls back to compiling ~72 CUDA kernels from source. A faster route is to pin `flash-attn==2.8.1` and reuse the GitHub release's prebuilt aarch64 wheel:
>
> ```toml
> # In pyproject.toml under [project] dependencies:
> "flash-attn==2.8.1",
>
> # In [tool.uv.sources]:
> flash-attn = [
> { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
> marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
> ]
> ```
>
> With this pin, `uv sync` finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run.
Activate the virtual environment:
```bash
source .venv/bin/activate
```
Verify GPU access:
```bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))"
```
Expected output: `NVIDIA GB300`
> [!NOTE]
> Examples in this playbook use **`CUDA_VISIBLE_DEVICES=0`** because the GB300 is at index `0` on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run `nvidia-smi --query-gpu=index,name --format=csv,noheader`, find the GB300 row, and substitute that index everywhere `CUDA_VISIBLE_DEVICES=0` appears below.
# Step 2. PyAV patch for LIBERO video (strongly recommended)
On many stacks **`torchcodec`** fails to import or build, the resolver falls back to **`pyav`**, and stock **`n1.6-release`** can raise **`NotImplementedError`** from `get_frames_by_indices` for the **`pyav`** backend (fallback order is already `torchcodec` → `decord` → `pyav` → `ffmpeg`). Without this patch, training may **appear hung**: GPU idle, no traceback, while **ffmpeg** spawns per-frame decode work on the CPU.
From the **Isaac-GR00T repo root** with **`n1.6-release`** checked out and **`.venv` activated**:
```bash
git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch
uv pip install av
```
If you copied `nvidia/station-gr00t/assets/patches/` into the Isaac-GR00T root instead, use `git apply assets/patches/001-pyav-get-frames-by-indices.patch`.
Details and re-apply rules: `nvidia/station-gr00t/assets/patches/README.md`.
After patching, repeated log lines such as `Video backend 'torchcodec' is not available, falling back to 'pyav'` are **expected** and noisy but not fatal.
# Step 3. Set up HuggingFace authentication
```bash
export HF_TOKEN="your_huggingface_token"
```
Get a token from https://huggingface.co/settings/tokens if you don't have one.
# Step 4. Download the dataset and model
Download the LIBERO Spatial dataset and the GR00T N1.6 base model:
```bash
# Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
--repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
# Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/
# Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B
```
> [!NOTE]
> **HF cache permission errors:** If `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`, the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run:
>
> ```bash
> export HF_HOME=$HOME/hf_cache_gr00t
> ```
>
> **Transient `xet-read-token` 500 errors:** Hugging Face's xet backend occasionally returns `500 Internal Server Error` for dataset downloads. Disable it:
>
> ```bash
> export HF_HUB_DISABLE_XET=1
> ```
Verify the dataset is ready:
```bash
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
```
**Expected result:** the command prints the full path to **`modality.json`** (and `ls` exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata.
# Step 5. Verify the base model loads and runs
Confirm the GR00T N1.6 base model loads and produces actions using the **GR1** demo shipped on **`n1.6-release`**:
```bash
TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
--model-path nvidia/GR00T-N1.6-3B \
--dataset-path demo_data/gr1.PickNPlace \
--embodiment-tag GR1 \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8 \
--steps 32
```
**`TORCHDYNAMO_DISABLE=1`** avoids **`torch.compile`** / Triton paths that can fail on GB300 with **`ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'`**. Keep it on all **`standalone_inference_script.py`** invocations in this playbook unless you have a Triton build that supports SM103.
You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run.
> [!NOTE]
> The base model's pretrained processor does not include the **`LIBERO_PANDA`** embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the **base** checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark.
# Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial
Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of **128** — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput **when the dataloader keeps the GPU fed**.
```bash
CUDA_VISIBLE_DEVICES=0 python \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.6-3B \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--num-gpus 1 \
--output-dir output/libero_spatial_ft \
--save-steps 500 \
--save-total-limit 5 \
--max-steps 2000 \
--global-batch-size 128 \
--learning-rate 1e-4 \
--warmup-ratio 0.05 \
--weight-decay 1e-5 \
--state-dropout-prob 0.8 \
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
--dataloader-num-workers 4
```
If GPU utilization stays **near zero** for many minutes while the process is alive, suspect **video decoding** (see Step 2 patch and Troubleshooting). You can try **`--dataloader-num-workers 8`** if CPU cores are available.
Training runs for **2000 steps** at batch size 128 and takes approximately **2025 minutes** on GB300 when **`torchcodec`** is the active video backend.
> [!IMPORTANT]
> **With the PyAV fallback (Step 2 patch + no torchcodec)**, expect ~56 s per step instead of <1 s — so 2000 steps is closer to **2.53 hours**, and GPU utilization sits in the 330 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower `--max-steps` (e.g. `100`) and `--save-steps` (e.g. `50`); loss should still drop visibly (validated drop **1.07 → 0.63** in 100 steps in this playbook's GB300 run). If you need full-throughput training, build `torchcodec` from source (Troubleshooting → "Video decoding errors") or run **Option A** which builds it for you.
> [!NOTE]
> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published **97.65%** success rate on LIBERO Spatial, increase to **20,000 steps** (`--max-steps 20000`). Published settings used batch size **640** across **8** GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference.
**What the training flags mean:**
| Flag | Value | Purpose |
|------|-------|---------|
| `--global-batch-size` | 128 | Total samples per training step; enabled by GB300 memory. |
| `--state-dropout-prob` | 0.8 | Drops proprioceptive state 80% of the time so the model relies on vision. |
| `--color-jitter-params` | brightness/contrast/saturation/hue | Photometric augmentation for lighting robustness. |
| `--warmup-ratio` | 0.05 | Linear LR warmup over the first 5% of steps. |
| `--save-steps` | 500 | Checkpoint cadence under `output/libero_spatial_ft/`. |
Monitor the Hugging Face **Trainer** `loss` in the terminal. Checkpoints land under `output/libero_spatial_ft/`.
# Step 7. Evaluate the fine-tuned model
Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to **`/tmp/open_loop_eval/`**:
```bash
CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--traj-ids 0 1 2 \
--action-horizon 16
```
**How to read the run:** the terminal prints **per-trajectory MSE/MAE** and **averages**. The JPEGs under **`/tmp/open_loop_eval/`** overlay **predicted** vs **ground-truth** trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks.
> [!TIP]
> At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches **97.65%** in closed-loop sim.
# Step 8. Run inference on a LIBERO sample (timing + actions)
This step passes **LIBERO Spatial** observations through the **fine-tuned** checkpoint (the base model cannot run this embodiment). **`TORCHDYNAMO_DISABLE=1`** is included for GB300:
```bash
TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8
```
**What to inspect:** the script prints a **timing breakdown** (data processing, backbone, action head, end-to-end). Compare **MSE/MAE** and latency to Step 5's base-model smoke test. In eager mode (with `TORCHDYNAMO_DISABLE=1`), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect **~34 s/step** on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned `checkpoint-100`); a compiled torch 2.7 + cu128 stack with Triton support for `sm_103` can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks.
# Step 9. Clean up
```bash
deactivate
cd ..
rm -rf Isaac-GR00T
```
Fine-tuned checkpoints under `output/libero_spatial_ft/` are removed with the repo. Copy them elsewhere first if you want to keep them.
# Next steps
- **Increase training steps** — `--max-steps 20000` for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput).
- **Other LIBERO suites** — `libero_10_no_noops`, `libero_goal_no_noops`, `libero_object_no_noops` from **IPEC-COMMUNITY** on Hugging Face.
- **Closed-loop sim** — LIBERO sim server/client: [LIBERO evaluation in Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/examples/LIBERO/README.md#evaluate-checkpoint).
- **Custom embodiments** — [Fine-tune a new embodiment](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/getting_started/finetune_new_embodiment.md) (LeRobot v2 + modality JSON).
- **Tune more of the stack** — `--tune-llm` / `--tune-visual` raise memory use; probe batch size if you enable them.
-
id: troubleshooting
label: Troubleshooting
content: |
# Common Issues
## Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)
**Solution:**
```bash
sudo apt-get install -y git-lfs
git lfs install
```
Remove any partial `Isaac-GR00T` directory, then clone again with `--recurse-submodules`.
## Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook
**Cause:** The repository default branch (**`main`**) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts.
**Solution:**
```bash
cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive
```
Always run playbook commands from **`n1.6-release`** for **N1.6** + **GR00T-N1.6-3B**.
## Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes
**Facts:** `scripts/deployment/dgpu/install_deps.sh` runs **`sudo apt-get`** to install **`ffmpeg`**, **`libaio-dev`**, and (on aarch64) FFmpeg **development** libraries for the **torchcodec** build. If **`/usr/local/cuda`** does not exist, it adds the NVIDIA CUDA apt repo and installs **`cuda-toolkit-12-8`**. It also installs **`uv`** into the user account if missing, then **`uv sync`** + **`uv pip install -e .`** into **`.venv`**.
**Solution (policy-friendly):** Pre-install the same system packages and CUDA using your IT process, ensure **`nvcc`** works, then from the repo root:
```bash
export PATH="$HOME/.local/bin:$PATH"
uv sync
uv pip install -e .
```
On **aarch64**, you still need **`torchcodec`** in `.venv` or rely on the **PyAV patch** (Instructions Step 2) plus **`uv pip install av`**.
## Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64
**Cause:** Upstream `pyproject.toml` lists pre-built `flash-attn==2.7.4.post1` wheels only for `linux_x86_64`. On **aarch64** (Grace + GB300), `uv` falls back to a from-source build that compiles ~72 CUDA kernels — typically **~2 hours** end-to-end.
**Solution:** Pin to `flash-attn==2.8.1` and use the GitHub release's prebuilt aarch64 wheel. Edit `pyproject.toml` in the repo root:
```toml
# under [project] dependencies, replace:
# "flash-attn==2.7.4.post1",
"flash-attn==2.8.1",
# under [tool.uv.sources], add:
flash-attn = [
{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]
```
The `cu12torch2.10` aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — `uv sync` completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in.
If you must keep `flash-attn==2.7.4.post1` (Option A path), expect the 2-hour build on first sync; subsequent `uv sync` invocations re-use the cached wheel.
## Issue: `install_deps.sh` fails building torchcodec
**Solution:**
Ensure the license confirmation env var is set:
```bash
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
```
If the build still fails, install FFmpeg development libraries:
```bash
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
pkg-config cmake build-essential pybind11-dev
```
Then apply **Instructions Step 2** (PyAV patch) so training does not depend on a working **torchcodec** for indexed frame reads.
## Issue: `huggingface-cli download` fails with 401 Unauthorized
**Solution:**
```bash
echo $HF_TOKEN
huggingface-cli whoami
```
If the token is not set:
```bash
export HF_TOKEN="your_token_here"
```
Accept any required license or gated-model agreements on the Hugging Face model page.
## Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`
**Cause:** The shared cache directory was previously created by a Docker container running as **root** (common on multi-user dev boxes that mount `~/.cache/huggingface` into containers without `--user`). The current user (`nvidia`) cannot write into it.
**Solution:** point HF at a user-owned cache location for this run:
```bash
export HF_HOME=$HOME/hf_cache_gr00t
mkdir -p "$HF_HOME"
huggingface-cli download nvidia/GR00T-N1.6-3B
```
Re-export `HF_HOME` for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown `~/.cache/huggingface` back to your user.
## Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint
**Cause:** Hugging Face's xet content-addressable backend occasionally returns transient `5xx`. This blocks dataset downloads even though the underlying files are reachable via the legacy backend.
**Solution:** disable xet for the download:
```bash
export HF_HUB_DISABLE_XET=1
huggingface-cli download --repo-type dataset \
IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
```
## Issue: `externally-managed-environment` or `pip` installs not going into `.venv`
**Cause:** Debian/Ubuntu **PEP 668** blocks `pip install` onto the system Python. Mixing **`sudo pip`** with the project venv breaks the playbook.
**Solution:**
1. **`source .venv/bin/activate`** — prompt should show `(.venv)`.
2. Use **`uv pip install ...`** (or **`python -m pip install ...`**) **only** with the venv activated — never `sudo pip` for this project.
3. If the venv was created with a broken `pip`, recreate: `rm -rf .venv` and run **`uv sync`** again from the repo root (after `n1.6-release` checkout).
## Issue: CUDA out of memory during fine-tuning
**Solution:**
Reduce batch size:
```bash
--global-batch-size 64
```
Check for other GPU processes: `nvidia-smi`. **`--tune-llm`** / **`--tune-visual`** increase memory use substantially.
## Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)
**Symptom:**
```text
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
```
**Solution:**
For **`scripts/deployment/standalone_inference_script.py`** (which may use **`torch.compile`**), prepend:
```bash
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
```
This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and **`open_loop_eval.py`** typically run without this compile path; use the same prefix there **only** if you see the same crash.
## Issue: `ModuleNotFoundError: No module named 'gr00t'`
**Solution:**
```bash
source .venv/bin/activate
pwd # .../Isaac-GR00T
```
## Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`
**Cause:** On **`n1.6-release`**, **`resolve_backend`** can select **`pyav`**, but stock **`get_frames_by_indices`** did not implement the **`pyav`** branch.
**Solution:** Apply the playbook patch and install PyAV (see **Instructions Step 2** and `assets/patches/README.md`).
## Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps
**Cause:** Fallback to **per-frame `ffmpeg` subprocess** decoding for **AV1** LIBERO clips; dataloaders starve the GPU.
**Solution:**
1. Apply the **PyAV patch** (Step 2) and **`uv pip install av`**.
2. Optionally increase **`--dataloader-num-workers`** (for example **8**) if CPUs are free.
**Expected noise after patching:** logs may repeat `Video backend 'torchcodec' is not available, falling back to 'pyav'` — that is normal if **torchcodec** is absent.
## Issue: Video decoding errors / `torchcodec` not found (general)
**Solution:**
Prefer the **PyAV patch + `av`** path above for LIBERO on GB300.
If you must build **torchcodec** into `.venv` manually (aarch64), with FFmpeg dev packages installed:
```bash
# Run this from inside the Isaac-GR00T repo root (the directory that
# contains .venv). Capture its absolute path BEFORE changing directories
# so we can still reach the virtualenv after cd'ing into /tmp/torchcodec.
GR00T_ROOT="$(pwd)"
# Sanity check — the virtualenv interpreter must already exist.
test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; }
# Clone the torchcodec source into /tmp/torchcodec (skip if already cloned).
git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec
# Build torchcodec into the Isaac-GR00T virtualenv using the absolute
# path captured above (do NOT use the relative ".venv/bin/python" here —
# the current directory is /tmp/torchcodec, which has no .venv).
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation
```
CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the **PyAV patch** instead.
## Issue: Training loss is not decreasing
**Solution:**
At 2000 steps the model may still be early. If loss is flat after many steps:
1. Verify modality file: `ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json`
2. Confirm **`--embodiment-tag LIBERO_PANDA`**
3. Try **`--learning-rate 5e-4`** for faster early movement on short runs
## Issue: `nvidia-smi` shows the wrong GPU
**Solution:**
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
CUDA_VISIBLE_DEVICES=<gb300_index> python ...
```
## Issue: OpenCV or decord cannot decode LIBERO AV1
**Notes:** **OpenCV** often fails on **AV1** in LIBERO assets. **decord** may lack a compatible wheel for your platform. The **PyAV** patch path is the supported mitigation in this playbook.
resources:
- name: Isaac GR00T (GitHub)
url: https://github.com/NVIDIA/Isaac-GR00T
- name: GR00T N1.6 Model (HuggingFace)
url: https://huggingface.co/nvidia/GR00T-N1.6-3B
- name: GR00T N1.6 Research Blog
url: https://research.nvidia.com/labs/gear/gr00t-n1_6/
- name: GR00T N1 Paper
url: https://arxiv.org/abs/2503.14734
- name: LIBERO Benchmark
url: https://libero-project.github.io/main.html