mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 04:22:21 +00:00
659 lines
36 KiB
YAML
659 lines
36 KiB
YAML
kind: Playbook
|
||
metadata:
|
||
name: station-gr00t
|
||
displayName: Isaac GR00T N1.6 Fine-Tuning
|
||
shortDescription: Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
|
||
|
||
publisher: nvidia
|
||
description: |
|
||
# REPLACE THIS WITH YOUR MODEL CARD
|
||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||
|
||
labelsV2:
|
||
- gpuType:playbook:gpu_type_station
|
||
- DGX Station
|
||
- GB300
|
||
- Robotics
|
||
- Isaac GR00T
|
||
- Fine-Tuning
|
||
- Blackwell
|
||
- VLA
|
||
|
||
attributes:
|
||
- key: DURATION
|
||
value: 45 MIN
|
||
|
||
spec:
|
||
artifactName: station-gr00t
|
||
nvcfFunctionId: None
|
||
attributes:
|
||
|
||
showUnavailableBanner: false
|
||
apiDocsUrl: None
|
||
termsOfUse: |
|
||
|
||
cta:
|
||
text: View on GitHub
|
||
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-gr00t/
|
||
|
||
|
||
tabs:
|
||
-
|
||
id: overview
|
||
|
||
label: Overview
|
||
content: |
|
||
# Basic idea
|
||
|
||
NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning.
|
||
|
||
High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo:
|
||
|
||

|
||
|
||
*Source: [NVIDIA Isaac GR00T — `media/GR00T-reference-arch-diagram.png`](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/media/GR00T-reference-arch-diagram.png). If the local image above is missing, the upstream copy is at `https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png`.*
|
||
|
||
In this playbook you will fine-tune GR00T N1.6 on the **LIBERO Spatial** benchmark on a **DGX Station** with **GB300** (large unified memory). That setup supports a high **global batch size (128)** on a single GPU, which improves training throughput compared to typical 24–80 GB consumer or datacenter GPUs.
|
||
|
||
# LIBERO Spatial (what you are fine-tuning on)
|
||
|
||
**LIBERO Spatial** is part of the [LIBERO](https://libero-project.github.io/main.html) suite of simulated tabletop manipulation benchmarks. The **spatial** split emphasizes **where** objects need to be placed: tasks such as putting a bowl on a **stove burner** vs a **plate**, placing utensils in a **mug** vs next to it, or moving objects to **left/right/front** targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots.
|
||
|
||
# What kind of fine-tuning this playbook uses
|
||
|
||
This playbook runs the **default Isaac GR00T fine-tuning recipe** from `launch_finetune.py`: **not** full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the **action head (DiT)** and **projector / adapter paths** that map observations into the action model, with strong **state dropout** and **color jitter** so the policy leans on vision. Optional flags such as `--tune-llm` or `--tune-visual` (mentioned under Next steps) trade compute and memory for updating more of the backbone. **LoRA** is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook.
|
||
|
||
# NVIDIA DGX Station (why this hardware)
|
||
|
||
**DGX Station** is a deskside AI system built for **large-memory GPU** training and inference (this playbook targets **GB300** with **284 GB HBM3e**). Beyond robotics, the same class of machine supports **large-model fine-tuning**, **RAG serving**, **multi-modal training**, and **CUDA research** where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting **much larger batch sizes** per GPU than on smaller cards, which stabilizes gradients and improves **samples per second** when the data pipeline keeps up.
|
||
|
||
# What you'll accomplish
|
||
|
||
- Check out the **`n1.6-release`** branch of Isaac GR00T so commands, embodiment tags, and `demo_data/` match GR00T **N1.6**
|
||
- Set up the environment with `uv` (project-local `.venv`) and understand what the optional `install_deps.sh` script changes on the system
|
||
- Apply the recommended **PyAV `get_frames_by_indices` patch** when `torchcodec` is unavailable so LIBERO **AV1** video decoding does not stall on an **ffmpeg** subprocess fallback
|
||
- Verify the base model, fine-tune on LIBERO Spatial at batch size **128**, run open-loop evaluation, and measure inference latency (with **GB300 / Blackwell** TorchDynamo compilation notes)
|
||
|
||
# What to know before starting
|
||
|
||
- Familiarity with Python virtual environments (`source .venv/bin/activate`)
|
||
- Familiarity with PyTorch training concepts (batch size, loss, checkpoints)
|
||
- Basic robot manipulation vocabulary (trajectories, observations, actions)
|
||
- Comfort running commands that may use **`sudo`** for system packages (or use the documented user-space alternative)
|
||
|
||
# Prerequisites
|
||
|
||
- NVIDIA **DGX Station** with **GB300** (Blackwell SM103, 284 GB HBM3e)
|
||
- CUDA toolkit usable by PyTorch: `nvcc --version` should show **CUDA 12.8+** (often already under `/usr/local/cuda` on DGX images)
|
||
- **Git** and **Git LFS** (`git lfs version`) — LFS is required for some demo assets and submodules; install with `sudo apt-get install -y git-lfs` then `git lfs install` if missing
|
||
- Hugging Face account and **HF_TOKEN** for model and dataset downloads
|
||
- Network access to Hugging Face, GitHub, and PyPI
|
||
- At least **~30 GB** free disk for `.venv`, checkpoints, and the LIBERO download
|
||
|
||
# Time & risk
|
||
|
||
* **Duration:** ~45 minutes end-to-end when the video backend is healthy (setup, downloads, ~20–25 min training at 2000 steps, eval and inference)
|
||
* **Risks:** `scripts/deployment/dgpu/install_deps.sh` performs **system-level** `apt` operations and may install the **CUDA 12.8 toolkit** if `/usr/local/cuda` is absent (see Instructions). Model download requires Hugging Face authentication.
|
||
* **Rollback:** Remove the cloned `Isaac-GR00T` directory and optionally `rm -rf ~/.local/share/uv` if you want to reclaim `uv` caches. Reverting `apt`-installed packages is a separate admin task; the playbook does not uninstall them automatically.
|
||
* **Last Updated:** 05/26/2026
|
||
* First Publication
|
||
|
||
|
||
|
||
-
|
||
id: instructions
|
||
|
||
label: Instructions
|
||
content: |
|
||
# Step 1. Clone Isaac GR00T and install dependencies
|
||
|
||
## 1a. Git LFS (required for a clean clone)
|
||
|
||
If `git clone` fails with errors about **Git LFS** or missing pointer files, install and initialize LFS, then remove any partial `Isaac-GR00T` directory and clone again:
|
||
|
||
```bash
|
||
sudo apt-get update
|
||
sudo apt-get install -y git-lfs
|
||
git lfs install
|
||
```
|
||
|
||
## 1b. Clone and check out `n1.6-release`
|
||
|
||
The **`main`** branch tracks ongoing development (for example newer GR00T milestones) and **does not** always match this **N1.6** playbook. Embodiment tags such as **`GR1`**, paths like **`demo_data/gr1.PickNPlace`**, and tutorial scripts are aligned with the **`n1.6-release`** branch.
|
||
|
||
```bash
|
||
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
|
||
cd Isaac-GR00T
|
||
git fetch origin
|
||
git checkout n1.6-release
|
||
git submodule update --init --recursive
|
||
```
|
||
|
||
## 1c. Install Python dependencies
|
||
|
||
### Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)
|
||
|
||
This script is the supported path. It may make **system-level** changes:
|
||
|
||
- Runs `apt-get update` and installs **`ffmpeg`** and **`libaio-dev`**
|
||
- If **`/usr/local/cuda`** is missing, adds the NVIDIA CUDA apt repository and installs **`cuda-toolkit-12-8`**
|
||
- Installs **`uv`** into your user account if needed, then runs **`uv sync`** and **`uv pip install -e .`** into the project **`.venv`**
|
||
- On **aarch64** only: installs FFmpeg **development** packages and **builds `torchcodec` from source** into `.venv`
|
||
|
||
```bash
|
||
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
|
||
```
|
||
|
||
### Option B — User-space only (no `install_deps.sh`)
|
||
|
||
Use this only when **CUDA 12.8+** is already installed, system **`ffmpeg`** / **`libaio-dev`** are already present, and your policy forbids the script's `apt` or CUDA steps. From the **Isaac-GR00T** repo root, install **`uv`** if needed, then:
|
||
|
||
```bash
|
||
command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
|
||
export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH"
|
||
export CUDA_HOME=/usr/local/cuda
|
||
uv sync
|
||
uv pip install -e .
|
||
```
|
||
|
||
You still need a working **video backend** for LIBERO (see Step 2). On aarch64, building **torchcodec** inside `.venv` without the script is possible but manual; see Troubleshooting.
|
||
|
||
> [!IMPORTANT]
|
||
> **`PATH` and `CUDA_HOME` matter on multi-toolkit hosts.** If the system has both an old Ubuntu `nvidia-cuda-toolkit` package (`/usr/bin/nvcc` ≈ 12.0) and a current NVIDIA CUDA repo install (`/usr/local/cuda-13.x/bin/nvcc`), `uv` will pick whichever appears first on `PATH`. Putting `/usr/local/cuda/bin` first (and exporting `CUDA_HOME`) is required for `flash-attn`'s source build to find the matching toolkit. Verify with `nvcc --version` after the export.
|
||
|
||
> [!WARNING]
|
||
> **`flash-attn` build on aarch64 takes ~2 hours from source.** The upstream `pyproject.toml` only lists pre-built `flash-attn==2.7.4.post1` wheels for **`x86_64`**; on aarch64 (Grace + GB300), `uv sync` falls back to compiling ~72 CUDA kernels from source. A faster route is to pin `flash-attn==2.8.1` and reuse the GitHub release's prebuilt aarch64 wheel:
|
||
>
|
||
> ```toml
|
||
> # In pyproject.toml under [project] dependencies:
|
||
> "flash-attn==2.8.1",
|
||
>
|
||
> # In [tool.uv.sources]:
|
||
> flash-attn = [
|
||
> { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
|
||
> marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
|
||
> ]
|
||
> ```
|
||
>
|
||
> With this pin, `uv sync` finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run.
|
||
|
||
Activate the virtual environment:
|
||
|
||
```bash
|
||
source .venv/bin/activate
|
||
```
|
||
|
||
Verify GPU access:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))"
|
||
```
|
||
|
||
Expected output: `NVIDIA GB300`
|
||
|
||
> [!NOTE]
|
||
> Examples in this playbook use **`CUDA_VISIBLE_DEVICES=0`** because the GB300 is at index `0` on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run `nvidia-smi --query-gpu=index,name --format=csv,noheader`, find the GB300 row, and substitute that index everywhere `CUDA_VISIBLE_DEVICES=0` appears below.
|
||
|
||
# Step 2. PyAV patch for LIBERO video (strongly recommended)
|
||
|
||
On many stacks **`torchcodec`** fails to import or build, the resolver falls back to **`pyav`**, and stock **`n1.6-release`** can raise **`NotImplementedError`** from `get_frames_by_indices` for the **`pyav`** backend (fallback order is already `torchcodec` → `decord` → `pyav` → `ffmpeg`). Without this patch, training may **appear hung**: GPU idle, no traceback, while **ffmpeg** spawns per-frame decode work on the CPU.
|
||
|
||
From the **Isaac-GR00T repo root** with **`n1.6-release`** checked out and **`.venv` activated**:
|
||
|
||
```bash
|
||
git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch
|
||
uv pip install av
|
||
```
|
||
|
||
If you copied `nvidia/station-gr00t/assets/patches/` into the Isaac-GR00T root instead, use `git apply assets/patches/001-pyav-get-frames-by-indices.patch`.
|
||
|
||
Details and re-apply rules: `nvidia/station-gr00t/assets/patches/README.md`.
|
||
|
||
After patching, repeated log lines such as `Video backend 'torchcodec' is not available, falling back to 'pyav'` are **expected** and noisy but not fatal.
|
||
|
||
# Step 3. Set up HuggingFace authentication
|
||
|
||
```bash
|
||
export HF_TOKEN="your_huggingface_token"
|
||
```
|
||
|
||
Get a token from https://huggingface.co/settings/tokens if you don't have one.
|
||
|
||
# Step 4. Download the dataset and model
|
||
|
||
Download the LIBERO Spatial dataset and the GR00T N1.6 base model:
|
||
|
||
```bash
|
||
# Download LIBERO Spatial dataset (~2-3 GB)
|
||
huggingface-cli download \
|
||
--repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
|
||
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
|
||
|
||
# Copy the LIBERO modality config into the dataset's meta/ directory
|
||
cp examples/LIBERO/modality.json \
|
||
examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/
|
||
|
||
# Download GR00T N1.6 base model (~6 GB)
|
||
huggingface-cli download nvidia/GR00T-N1.6-3B
|
||
```
|
||
|
||
> [!NOTE]
|
||
> **HF cache permission errors:** If `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`, the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run:
|
||
>
|
||
> ```bash
|
||
> export HF_HOME=$HOME/hf_cache_gr00t
|
||
> ```
|
||
>
|
||
> **Transient `xet-read-token` 500 errors:** Hugging Face's xet backend occasionally returns `500 Internal Server Error` for dataset downloads. Disable it:
|
||
>
|
||
> ```bash
|
||
> export HF_HUB_DISABLE_XET=1
|
||
> ```
|
||
|
||
Verify the dataset is ready:
|
||
|
||
```bash
|
||
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
|
||
```
|
||
|
||
**Expected result:** the command prints the full path to **`modality.json`** (and `ls` exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata.
|
||
|
||
# Step 5. Verify the base model loads and runs
|
||
|
||
Confirm the GR00T N1.6 base model loads and produces actions using the **GR1** demo shipped on **`n1.6-release`**:
|
||
|
||
```bash
|
||
TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
|
||
--model-path nvidia/GR00T-N1.6-3B \
|
||
--dataset-path demo_data/gr1.PickNPlace \
|
||
--embodiment-tag GR1 \
|
||
--traj-ids 0 \
|
||
--inference-mode pytorch \
|
||
--action-horizon 8 \
|
||
--steps 32
|
||
```
|
||
|
||
**`TORCHDYNAMO_DISABLE=1`** avoids **`torch.compile`** / Triton paths that can fail on GB300 with **`ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'`**. Keep it on all **`standalone_inference_script.py`** invocations in this playbook unless you have a Triton build that supports SM103.
|
||
|
||
You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run.
|
||
|
||
> [!NOTE]
|
||
> The base model's pretrained processor does not include the **`LIBERO_PANDA`** embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the **base** checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark.
|
||
|
||
# Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial
|
||
|
||
Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of **128** — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput **when the dataloader keeps the GPU fed**.
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 python \
|
||
gr00t/experiment/launch_finetune.py \
|
||
--base-model-path nvidia/GR00T-N1.6-3B \
|
||
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
|
||
--embodiment-tag LIBERO_PANDA \
|
||
--num-gpus 1 \
|
||
--output-dir output/libero_spatial_ft \
|
||
--save-steps 500 \
|
||
--save-total-limit 5 \
|
||
--max-steps 2000 \
|
||
--global-batch-size 128 \
|
||
--learning-rate 1e-4 \
|
||
--warmup-ratio 0.05 \
|
||
--weight-decay 1e-5 \
|
||
--state-dropout-prob 0.8 \
|
||
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
|
||
--dataloader-num-workers 4
|
||
```
|
||
|
||
If GPU utilization stays **near zero** for many minutes while the process is alive, suspect **video decoding** (see Step 2 patch and Troubleshooting). You can try **`--dataloader-num-workers 8`** if CPU cores are available.
|
||
|
||
Training runs for **2000 steps** at batch size 128 and takes approximately **20–25 minutes** on GB300 when **`torchcodec`** is the active video backend.
|
||
|
||
> [!IMPORTANT]
|
||
> **With the PyAV fallback (Step 2 patch + no torchcodec)**, expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to **2.5–3 hours**, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower `--max-steps` (e.g. `100`) and `--save-steps` (e.g. `50`); loss should still drop visibly (validated drop **1.07 → 0.63** in 100 steps in this playbook's GB300 run). If you need full-throughput training, build `torchcodec` from source (Troubleshooting → "Video decoding errors") or run **Option A** which builds it for you.
|
||
|
||
> [!NOTE]
|
||
> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published **97.65%** success rate on LIBERO Spatial, increase to **20,000 steps** (`--max-steps 20000`). Published settings used batch size **640** across **8** GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference.
|
||
|
||
**What the training flags mean:**
|
||
|
||
| Flag | Value | Purpose |
|
||
|------|-------|---------|
|
||
| `--global-batch-size` | 128 | Total samples per training step; enabled by GB300 memory. |
|
||
| `--state-dropout-prob` | 0.8 | Drops proprioceptive state 80% of the time so the model relies on vision. |
|
||
| `--color-jitter-params` | brightness/contrast/saturation/hue | Photometric augmentation for lighting robustness. |
|
||
| `--warmup-ratio` | 0.05 | Linear LR warmup over the first 5% of steps. |
|
||
| `--save-steps` | 500 | Checkpoint cadence under `output/libero_spatial_ft/`. |
|
||
|
||
Monitor the Hugging Face **Trainer** `loss` in the terminal. Checkpoints land under `output/libero_spatial_ft/`.
|
||
|
||
# Step 7. Evaluate the fine-tuned model
|
||
|
||
Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to **`/tmp/open_loop_eval/`**:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \
|
||
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
|
||
--embodiment-tag LIBERO_PANDA \
|
||
--model-path output/libero_spatial_ft/checkpoint-2000/ \
|
||
--traj-ids 0 1 2 \
|
||
--action-horizon 16
|
||
```
|
||
|
||
**How to read the run:** the terminal prints **per-trajectory MSE/MAE** and **averages**. The JPEGs under **`/tmp/open_loop_eval/`** overlay **predicted** vs **ground-truth** trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks.
|
||
|
||
> [!TIP]
|
||
> At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches **97.65%** in closed-loop sim.
|
||
|
||
# Step 8. Run inference on a LIBERO sample (timing + actions)
|
||
|
||
This step passes **LIBERO Spatial** observations through the **fine-tuned** checkpoint (the base model cannot run this embodiment). **`TORCHDYNAMO_DISABLE=1`** is included for GB300:
|
||
|
||
```bash
|
||
TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
|
||
--model-path output/libero_spatial_ft/checkpoint-2000/ \
|
||
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
|
||
--embodiment-tag LIBERO_PANDA \
|
||
--traj-ids 0 \
|
||
--inference-mode pytorch \
|
||
--action-horizon 8
|
||
```
|
||
|
||
**What to inspect:** the script prints a **timing breakdown** (data processing, backbone, action head, end-to-end). Compare **MSE/MAE** and latency to Step 5's base-model smoke test. In eager mode (with `TORCHDYNAMO_DISABLE=1`), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect **~3–4 s/step** on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned `checkpoint-100`); a compiled torch 2.7 + cu128 stack with Triton support for `sm_103` can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks.
|
||
|
||
# Step 9. Clean up
|
||
|
||
```bash
|
||
deactivate
|
||
cd ..
|
||
rm -rf Isaac-GR00T
|
||
```
|
||
|
||
Fine-tuned checkpoints under `output/libero_spatial_ft/` are removed with the repo. Copy them elsewhere first if you want to keep them.
|
||
|
||
# Next steps
|
||
|
||
- **Increase training steps** — `--max-steps 20000` for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput).
|
||
- **Other LIBERO suites** — `libero_10_no_noops`, `libero_goal_no_noops`, `libero_object_no_noops` from **IPEC-COMMUNITY** on Hugging Face.
|
||
- **Closed-loop sim** — LIBERO sim server/client: [LIBERO evaluation in Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/examples/LIBERO/README.md#evaluate-checkpoint).
|
||
- **Custom embodiments** — [Fine-tune a new embodiment](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/getting_started/finetune_new_embodiment.md) (LeRobot v2 + modality JSON).
|
||
- **Tune more of the stack** — `--tune-llm` / `--tune-visual` raise memory use; probe batch size if you enable them.
|
||
|
||
|
||
|
||
-
|
||
id: troubleshooting
|
||
|
||
label: Troubleshooting
|
||
content: |
|
||
# Common Issues
|
||
|
||
## Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)
|
||
|
||
**Solution:**
|
||
|
||
```bash
|
||
sudo apt-get install -y git-lfs
|
||
git lfs install
|
||
```
|
||
|
||
Remove any partial `Isaac-GR00T` directory, then clone again with `--recurse-submodules`.
|
||
|
||
## Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook
|
||
|
||
**Cause:** The repository default branch (**`main`**) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts.
|
||
|
||
**Solution:**
|
||
|
||
```bash
|
||
cd Isaac-GR00T
|
||
git fetch origin
|
||
git checkout n1.6-release
|
||
git submodule update --init --recursive
|
||
```
|
||
|
||
Always run playbook commands from **`n1.6-release`** for **N1.6** + **GR00T-N1.6-3B**.
|
||
|
||
## Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes
|
||
|
||
**Facts:** `scripts/deployment/dgpu/install_deps.sh` runs **`sudo apt-get`** to install **`ffmpeg`**, **`libaio-dev`**, and (on aarch64) FFmpeg **development** libraries for the **torchcodec** build. If **`/usr/local/cuda`** does not exist, it adds the NVIDIA CUDA apt repo and installs **`cuda-toolkit-12-8`**. It also installs **`uv`** into the user account if missing, then **`uv sync`** + **`uv pip install -e .`** into **`.venv`**.
|
||
|
||
**Solution (policy-friendly):** Pre-install the same system packages and CUDA using your IT process, ensure **`nvcc`** works, then from the repo root:
|
||
|
||
```bash
|
||
export PATH="$HOME/.local/bin:$PATH"
|
||
uv sync
|
||
uv pip install -e .
|
||
```
|
||
|
||
On **aarch64**, you still need **`torchcodec`** in `.venv` or rely on the **PyAV patch** (Instructions Step 2) plus **`uv pip install av`**.
|
||
|
||
## Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64
|
||
|
||
**Cause:** Upstream `pyproject.toml` lists pre-built `flash-attn==2.7.4.post1` wheels only for `linux_x86_64`. On **aarch64** (Grace + GB300), `uv` falls back to a from-source build that compiles ~72 CUDA kernels — typically **~2 hours** end-to-end.
|
||
|
||
**Solution:** Pin to `flash-attn==2.8.1` and use the GitHub release's prebuilt aarch64 wheel. Edit `pyproject.toml` in the repo root:
|
||
|
||
```toml
|
||
# under [project] dependencies, replace:
|
||
# "flash-attn==2.7.4.post1",
|
||
"flash-attn==2.8.1",
|
||
|
||
# under [tool.uv.sources], add:
|
||
flash-attn = [
|
||
{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
|
||
marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
|
||
]
|
||
```
|
||
|
||
The `cu12torch2.10` aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — `uv sync` completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in.
|
||
|
||
If you must keep `flash-attn==2.7.4.post1` (Option A path), expect the 2-hour build on first sync; subsequent `uv sync` invocations re-use the cached wheel.
|
||
|
||
## Issue: `install_deps.sh` fails building torchcodec
|
||
|
||
**Solution:**
|
||
|
||
Ensure the license confirmation env var is set:
|
||
|
||
```bash
|
||
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
|
||
```
|
||
|
||
If the build still fails, install FFmpeg development libraries:
|
||
|
||
```bash
|
||
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
|
||
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
|
||
pkg-config cmake build-essential pybind11-dev
|
||
```
|
||
|
||
Then apply **Instructions Step 2** (PyAV patch) so training does not depend on a working **torchcodec** for indexed frame reads.
|
||
|
||
## Issue: `huggingface-cli download` fails with 401 Unauthorized
|
||
|
||
**Solution:**
|
||
|
||
```bash
|
||
echo $HF_TOKEN
|
||
huggingface-cli whoami
|
||
```
|
||
|
||
If the token is not set:
|
||
|
||
```bash
|
||
export HF_TOKEN="your_token_here"
|
||
```
|
||
|
||
Accept any required license or gated-model agreements on the Hugging Face model page.
|
||
|
||
## Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`
|
||
|
||
**Cause:** The shared cache directory was previously created by a Docker container running as **root** (common on multi-user dev boxes that mount `~/.cache/huggingface` into containers without `--user`). The current user (`nvidia`) cannot write into it.
|
||
|
||
**Solution:** point HF at a user-owned cache location for this run:
|
||
|
||
```bash
|
||
export HF_HOME=$HOME/hf_cache_gr00t
|
||
mkdir -p "$HF_HOME"
|
||
huggingface-cli download nvidia/GR00T-N1.6-3B
|
||
```
|
||
|
||
Re-export `HF_HOME` for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown `~/.cache/huggingface` back to your user.
|
||
|
||
## Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint
|
||
|
||
**Cause:** Hugging Face's xet content-addressable backend occasionally returns transient `5xx`. This blocks dataset downloads even though the underlying files are reachable via the legacy backend.
|
||
|
||
**Solution:** disable xet for the download:
|
||
|
||
```bash
|
||
export HF_HUB_DISABLE_XET=1
|
||
huggingface-cli download --repo-type dataset \
|
||
IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
|
||
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
|
||
```
|
||
|
||
## Issue: `externally-managed-environment` or `pip` installs not going into `.venv`
|
||
|
||
**Cause:** Debian/Ubuntu **PEP 668** blocks `pip install` onto the system Python. Mixing **`sudo pip`** with the project venv breaks the playbook.
|
||
|
||
**Solution:**
|
||
|
||
1. **`source .venv/bin/activate`** — prompt should show `(.venv)`.
|
||
2. Use **`uv pip install ...`** (or **`python -m pip install ...`**) **only** with the venv activated — never `sudo pip` for this project.
|
||
3. If the venv was created with a broken `pip`, recreate: `rm -rf .venv` and run **`uv sync`** again from the repo root (after `n1.6-release` checkout).
|
||
|
||
## Issue: CUDA out of memory during fine-tuning
|
||
|
||
**Solution:**
|
||
|
||
Reduce batch size:
|
||
|
||
```bash
|
||
--global-batch-size 64
|
||
```
|
||
|
||
Check for other GPU processes: `nvidia-smi`. **`--tune-llm`** / **`--tune-visual`** increase memory use substantially.
|
||
|
||
## Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)
|
||
|
||
**Symptom:**
|
||
|
||
```text
|
||
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
|
||
```
|
||
|
||
**Solution:**
|
||
|
||
For **`scripts/deployment/standalone_inference_script.py`** (which may use **`torch.compile`**), prepend:
|
||
|
||
```bash
|
||
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
|
||
```
|
||
|
||
This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and **`open_loop_eval.py`** typically run without this compile path; use the same prefix there **only** if you see the same crash.
|
||
|
||
## Issue: `ModuleNotFoundError: No module named 'gr00t'`
|
||
|
||
**Solution:**
|
||
|
||
```bash
|
||
source .venv/bin/activate
|
||
pwd # .../Isaac-GR00T
|
||
```
|
||
|
||
## Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`
|
||
|
||
**Cause:** On **`n1.6-release`**, **`resolve_backend`** can select **`pyav`**, but stock **`get_frames_by_indices`** did not implement the **`pyav`** branch.
|
||
|
||
**Solution:** Apply the playbook patch and install PyAV (see **Instructions Step 2** and `assets/patches/README.md`).
|
||
|
||
## Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps
|
||
|
||
**Cause:** Fallback to **per-frame `ffmpeg` subprocess** decoding for **AV1** LIBERO clips; dataloaders starve the GPU.
|
||
|
||
**Solution:**
|
||
|
||
1. Apply the **PyAV patch** (Step 2) and **`uv pip install av`**.
|
||
2. Optionally increase **`--dataloader-num-workers`** (for example **8**) if CPUs are free.
|
||
|
||
**Expected noise after patching:** logs may repeat `Video backend 'torchcodec' is not available, falling back to 'pyav'` — that is normal if **torchcodec** is absent.
|
||
|
||
## Issue: Video decoding errors / `torchcodec` not found (general)
|
||
|
||
**Solution:**
|
||
|
||
Prefer the **PyAV patch + `av`** path above for LIBERO on GB300.
|
||
|
||
If you must build **torchcodec** into `.venv` manually (aarch64), with FFmpeg dev packages installed:
|
||
|
||
```bash
|
||
# Run this from inside the Isaac-GR00T repo root (the directory that
|
||
# contains .venv). Capture its absolute path BEFORE changing directories
|
||
# so we can still reach the virtualenv after cd'ing into /tmp/torchcodec.
|
||
GR00T_ROOT="$(pwd)"
|
||
|
||
# Sanity check — the virtualenv interpreter must already exist.
|
||
test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; }
|
||
|
||
# Clone the torchcodec source into /tmp/torchcodec (skip if already cloned).
|
||
git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec
|
||
cd /tmp/torchcodec
|
||
|
||
# Build torchcodec into the Isaac-GR00T virtualenv using the absolute
|
||
# path captured above (do NOT use the relative ".venv/bin/python" here —
|
||
# the current directory is /tmp/torchcodec, which has no .venv).
|
||
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
|
||
uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation
|
||
```
|
||
|
||
CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the **PyAV patch** instead.
|
||
|
||
## Issue: Training loss is not decreasing
|
||
|
||
**Solution:**
|
||
|
||
At 2000 steps the model may still be early. If loss is flat after many steps:
|
||
|
||
1. Verify modality file: `ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json`
|
||
2. Confirm **`--embodiment-tag LIBERO_PANDA`**
|
||
3. Try **`--learning-rate 5e-4`** for faster early movement on short runs
|
||
|
||
## Issue: `nvidia-smi` shows the wrong GPU
|
||
|
||
**Solution:**
|
||
|
||
```bash
|
||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||
CUDA_VISIBLE_DEVICES=<gb300_index> python ...
|
||
```
|
||
|
||
## Issue: OpenCV or decord cannot decode LIBERO AV1
|
||
|
||
**Notes:** **OpenCV** often fails on **AV1** in LIBERO assets. **decord** may lack a compatible wheel for your platform. The **PyAV** patch path is the supported mitigation in this playbook.
|
||
|
||
|
||
|
||
|
||
resources:
|
||
- name: Isaac GR00T (GitHub)
|
||
url: https://github.com/NVIDIA/Isaac-GR00T
|
||
|
||
|
||
- name: GR00T N1.6 Model (HuggingFace)
|
||
url: https://huggingface.co/nvidia/GR00T-N1.6-3B
|
||
|
||
|
||
- name: GR00T N1.6 Research Blog
|
||
url: https://research.nvidia.com/labs/gear/gr00t-n1_6/
|
||
|
||
|
||
- name: GR00T N1 Paper
|
||
url: https://arxiv.org/abs/2503.14734
|
||
|
||
|
||
- name: LIBERO Benchmark
|
||
url: https://libero-project.github.io/main.html
|
||
|
||
|