dgx-spark-playbooks/nvidia/station-gr00t/endpoint-test.yaml

kind: Playbook
metadata:
  name: station-gr00t
  displayName: Isaac GR00T N1.6 Fine-Tuning
  shortDescription: Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station

  publisher: nvidia
  description: |
    # REPLACE THIS WITH YOUR MODEL CARD
    https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads

  labelsV2:
  - gpuType:playbook:gpu_type_station
  - DGX Station
  - GB300
  - Robotics
  - Isaac GR00T
  - Fine-Tuning
  - Blackwell
  - VLA

  attributes:
  - key: DURATION
    value: 45 MIN

spec:
  artifactName: station-gr00t
  nvcfFunctionId: None
  attributes:

    showUnavailableBanner: false
    apiDocsUrl: None
    termsOfUse: |

    cta:
      text: View on GitHub
      url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-gr00t/


    tabs:
    -
      id: overview

      label: Overview
      content: |
        # Basic idea

        NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning.

        High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo:

        ![GR00T N1.6 reference architecture](./assets/GR00T-reference-arch-diagram.png)

        *Source: [NVIDIA Isaac GR00T — `media/GR00T-reference-arch-diagram.png`](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/media/GR00T-reference-arch-diagram.png). If the local image above is missing, the upstream copy is at `https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png`.*

        In this playbook you will fine-tune GR00T N1.6 on the **LIBERO Spatial** benchmark on a **DGX Station** with **GB300** (large unified memory). That setup supports a high **global batch size (128)** on a single GPU, which improves training throughput compared to typical 24–80 GB consumer or datacenter GPUs.

        # LIBERO Spatial (what you are fine-tuning on)

        **LIBERO Spatial** is part of the [LIBERO](https://libero-project.github.io/main.html) suite of simulated tabletop manipulation benchmarks. The **spatial** split emphasizes **where** objects need to be placed: tasks such as putting a bowl on a **stove burner** vs a **plate**, placing utensils in a **mug** vs next to it, or moving objects to **left/right/front** targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots.

        # What kind of fine-tuning this playbook uses

        This playbook runs the **default Isaac GR00T fine-tuning recipe** from `launch_finetune.py`: **not** full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the **action head (DiT)** and **projector / adapter paths** that map observations into the action model, with strong **state dropout** and **color jitter** so the policy leans on vision. Optional flags such as `--tune-llm` or `--tune-visual` (mentioned under Next steps) trade compute and memory for updating more of the backbone. **LoRA** is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook.

        # NVIDIA DGX Station (why this hardware)

        **DGX Station** is a deskside AI system built for **large-memory GPU** training and inference (this playbook targets **GB300** with **284 GB HBM3e**). Beyond robotics, the same class of machine supports **large-model fine-tuning**, **RAG serving**, **multi-modal training**, and **CUDA research** where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting **much larger batch sizes** per GPU than on smaller cards, which stabilizes gradients and improves **samples per second** when the data pipeline keeps up.

        # What you'll accomplish

        - Check out the **`n1.6-release`** branch of Isaac GR00T so commands, embodiment tags, and `demo_data/` match GR00T **N1.6**
        - Set up the environment with `uv` (project-local `.venv`) and understand what the optional `install_deps.sh` script changes on the system
        - Apply the recommended **PyAV `get_frames_by_indices` patch** when `torchcodec` is unavailable so LIBERO **AV1** video decoding does not stall on an **ffmpeg** subprocess fallback
        - Verify the base model, fine-tune on LIBERO Spatial at batch size **128**, run open-loop evaluation, and measure inference latency (with **GB300 / Blackwell** TorchDynamo compilation notes)

        # What to know before starting

        - Familiarity with Python virtual environments (`source .venv/bin/activate`)
        - Familiarity with PyTorch training concepts (batch size, loss, checkpoints)
        - Basic robot manipulation vocabulary (trajectories, observations, actions)
        - Comfort running commands that may use **`sudo`** for system packages (or use the documented user-space alternative)

        # Prerequisites

        - NVIDIA **DGX Station** with **GB300** (Blackwell SM103, 284 GB HBM3e)
        - CUDA toolkit usable by PyTorch: `nvcc --version` should show **CUDA 12.8+** (often already under `/usr/local/cuda` on DGX images)
        - **Git** and **Git LFS** (`git lfs version`) — LFS is required for some demo assets and submodules; install with `sudo apt-get install -y git-lfs` then `git lfs install` if missing
        - Hugging Face account and **HF_TOKEN** for model and dataset downloads
        - Network access to Hugging Face, GitHub, and PyPI
        - At least **~30 GB** free disk for `.venv`, checkpoints, and the LIBERO download

        # Time & risk

        * **Duration:** ~45 minutes end-to-end when the video backend is healthy (setup, downloads, ~20–25 min training at 2000 steps, eval and inference)
        * **Risks:** `scripts/deployment/dgpu/install_deps.sh` performs **system-level** `apt` operations and may install the **CUDA 12.8 toolkit** if `/usr/local/cuda` is absent (see Instructions). Model download requires Hugging Face authentication.
        * **Rollback:** Remove the cloned `Isaac-GR00T` directory and optionally `rm -rf ~/.local/share/uv` if you want to reclaim `uv` caches. Reverting `apt`-installed packages is a separate admin task; the playbook does not uninstall them automatically.
        * **Last Updated:** 05/26/2026
          * First Publication


    -
      id: instructions

      label: Instructions
      content: |
        # Step 1. Clone Isaac GR00T and install dependencies

        ## 1a. Git LFS (required for a clean clone)

        If `git clone` fails with errors about **Git LFS** or missing pointer files, install and initialize LFS, then remove any partial `Isaac-GR00T` directory and clone again:

        ```bash
        sudo apt-get update
        sudo apt-get install -y git-lfs
        git lfs install
        ```

        ## 1b. Clone and check out `n1.6-release`

        The **`main`** branch tracks ongoing development (for example newer GR00T milestones) and **does not** always match this **N1.6** playbook. Embodiment tags such as **`GR1`**, paths like **`demo_data/gr1.PickNPlace`**, and tutorial scripts are aligned with the **`n1.6-release`** branch.

        ```bash
        git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
        cd Isaac-GR00T
        git fetch origin
        git checkout n1.6-release
        git submodule update --init --recursive
        ```

        ## 1c. Install Python dependencies

        ### Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)

        This script is the supported path. It may make **system-level** changes:

        - Runs `apt-get update` and installs **`ffmpeg`** and **`libaio-dev`**
        - If **`/usr/local/cuda`** is missing, adds the NVIDIA CUDA apt repository and installs **`cuda-toolkit-12-8`**
        - Installs **`uv`** into your user account if needed, then runs **`uv sync`** and **`uv pip install -e .`** into the project **`.venv`**
        - On **aarch64** only: installs FFmpeg **development** packages and **builds `torchcodec` from source** into `.venv`

        ```bash
        I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
        ```

        ### Option B — User-space only (no `install_deps.sh`)

        Use this only when **CUDA 12.8+** is already installed, system **`ffmpeg`** / **`libaio-dev`** are already present, and your policy forbids the script's `apt` or CUDA steps. From the **Isaac-GR00T** repo root, install **`uv`** if needed, then:

        ```bash
        command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
        export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH"
        export CUDA_HOME=/usr/local/cuda
        uv sync
        uv pip install -e .
        ```

        You still need a working **video backend** for LIBERO (see Step 2). On aarch64, building **torchcodec** inside `.venv` without the script is possible but manual; see Troubleshooting.

        > [!IMPORTANT]
        > **`PATH` and `CUDA_HOME` matter on multi-toolkit hosts.** If the system has both an old Ubuntu `nvidia-cuda-toolkit` package (`/usr/bin/nvcc` ≈ 12.0) and a current NVIDIA CUDA repo install (`/usr/local/cuda-13.x/bin/nvcc`), `uv` will pick whichever appears first on `PATH`. Putting `/usr/local/cuda/bin` first (and exporting `CUDA_HOME`) is required for `flash-attn`'s source build to find the matching toolkit. Verify with `nvcc --version` after the export.

        > [!WARNING]
        > **`flash-attn` build on aarch64 takes ~2 hours from source.** The upstream `pyproject.toml` only lists pre-built `flash-attn==2.7.4.post1` wheels for **`x86_64`**; on aarch64 (Grace + GB300), `uv sync` falls back to compiling ~72 CUDA kernels from source. A faster route is to pin `flash-attn==2.8.1` and reuse the GitHub release's prebuilt aarch64 wheel:
        >
        > ```toml
        > # In pyproject.toml under [project] dependencies:
        > "flash-attn==2.8.1",
        >
        > # In [tool.uv.sources]:
        > flash-attn = [
        >     { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
        >       marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
        > ]
        > ```
        >
        > With this pin, `uv sync` finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run.

        Activate the virtual environment:

        ```bash
        source .venv/bin/activate
        ```

        Verify GPU access:

        ```bash
        CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))"
        ```

        Expected output: `NVIDIA GB300`

        > [!NOTE]
        > Examples in this playbook use **`CUDA_VISIBLE_DEVICES=0`** because the GB300 is at index `0` on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run `nvidia-smi --query-gpu=index,name --format=csv,noheader`, find the GB300 row, and substitute that index everywhere `CUDA_VISIBLE_DEVICES=0` appears below.

        # Step 2. PyAV patch for LIBERO video (strongly recommended)

        On many stacks **`torchcodec`** fails to import or build, the resolver falls back to **`pyav`**, and stock **`n1.6-release`** can raise **`NotImplementedError`** from `get_frames_by_indices` for the **`pyav`** backend (fallback order is already `torchcodec` → `decord` → `pyav` → `ffmpeg`). Without this patch, training may **appear hung**: GPU idle, no traceback, while **ffmpeg** spawns per-frame decode work on the CPU.

        From the **Isaac-GR00T repo root** with **`n1.6-release`** checked out and **`.venv` activated**:

        ```bash
        git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch
        uv pip install av
        ```

        If you copied `nvidia/station-gr00t/assets/patches/` into the Isaac-GR00T root instead, use `git apply assets/patches/001-pyav-get-frames-by-indices.patch`.

        Details and re-apply rules: `nvidia/station-gr00t/assets/patches/README.md`.

        After patching, repeated log lines such as `Video backend 'torchcodec' is not available, falling back to 'pyav'` are **expected** and noisy but not fatal.

        # Step 3. Set up HuggingFace authentication

        ```bash
        export HF_TOKEN="your_huggingface_token"
        ```

        Get a token from https://huggingface.co/settings/tokens if you don't have one.

        # Step 4. Download the dataset and model

        Download the LIBERO Spatial dataset and the GR00T N1.6 base model:

        ```bash
        # Download LIBERO Spatial dataset (~2-3 GB)
        huggingface-cli download \
            --repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
            --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

        # Copy the LIBERO modality config into the dataset's meta/ directory
        cp examples/LIBERO/modality.json \
            examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/

        # Download GR00T N1.6 base model (~6 GB)
        huggingface-cli download nvidia/GR00T-N1.6-3B
        ```

        > [!NOTE]
        > **HF cache permission errors:** If `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`, the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run:
        >
        > ```bash
        > export HF_HOME=$HOME/hf_cache_gr00t
        > ```
        >
        > **Transient `xet-read-token` 500 errors:** Hugging Face's xet backend occasionally returns `500 Internal Server Error` for dataset downloads. Disable it:
        >
        > ```bash
        > export HF_HUB_DISABLE_XET=1
        > ```

        Verify the dataset is ready:

        ```bash
        ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
        ```

        **Expected result:** the command prints the full path to **`modality.json`** (and `ls` exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata.

        # Step 5. Verify the base model loads and runs

        Confirm the GR00T N1.6 base model loads and produces actions using the **GR1** demo shipped on **`n1.6-release`**:

        ```bash
        TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
            --model-path nvidia/GR00T-N1.6-3B \
            --dataset-path demo_data/gr1.PickNPlace \
            --embodiment-tag GR1 \
            --traj-ids 0 \
            --inference-mode pytorch \
            --action-horizon 8 \
            --steps 32
        ```

        **`TORCHDYNAMO_DISABLE=1`** avoids **`torch.compile`** / Triton paths that can fail on GB300 with **`ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'`**. Keep it on all **`standalone_inference_script.py`** invocations in this playbook unless you have a Triton build that supports SM103.

        You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run.

        > [!NOTE]
        > The base model's pretrained processor does not include the **`LIBERO_PANDA`** embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the **base** checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark.

        # Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial

        Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of **128** — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput **when the dataloader keeps the GPU fed**.

        ```bash
        CUDA_VISIBLE_DEVICES=0 python \
            gr00t/experiment/launch_finetune.py \
            --base-model-path nvidia/GR00T-N1.6-3B \
            --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
            --embodiment-tag LIBERO_PANDA \
            --num-gpus 1 \
            --output-dir output/libero_spatial_ft \
            --save-steps 500 \
            --save-total-limit 5 \
            --max-steps 2000 \
            --global-batch-size 128 \
            --learning-rate 1e-4 \
            --warmup-ratio 0.05 \
            --weight-decay 1e-5 \
            --state-dropout-prob 0.8 \
            --color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
            --dataloader-num-workers 4
        ```

        If GPU utilization stays **near zero** for many minutes while the process is alive, suspect **video decoding** (see Step 2 patch and Troubleshooting). You can try **`--dataloader-num-workers 8`** if CPU cores are available.

        Training runs for **2000 steps** at batch size 128 and takes approximately **20–25 minutes** on GB300 when **`torchcodec`** is the active video backend.

        > [!IMPORTANT]
        > **With the PyAV fallback (Step 2 patch + no torchcodec)**, expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to **2.5–3 hours**, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower `--max-steps` (e.g. `100`) and `--save-steps` (e.g. `50`); loss should still drop visibly (validated drop **1.07 → 0.63** in 100 steps in this playbook's GB300 run). If you need full-throughput training, build `torchcodec` from source (Troubleshooting → "Video decoding errors") or run **Option A** which builds it for you.

        > [!NOTE]
        > This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published **97.65%** success rate on LIBERO Spatial, increase to **20,000 steps** (`--max-steps 20000`). Published settings used batch size **640** across **8** GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference.

        **What the training flags mean:**

        | Flag | Value | Purpose |
        |------|-------|---------|
        | `--global-batch-size` | 128 | Total samples per training step; enabled by GB300 memory. |
        | `--state-dropout-prob` | 0.8 | Drops proprioceptive state 80% of the time so the model relies on vision. |
        | `--color-jitter-params` | brightness/contrast/saturation/hue | Photometric augmentation for lighting robustness. |
        | `--warmup-ratio` | 0.05 | Linear LR warmup over the first 5% of steps. |
        | `--save-steps` | 500 | Checkpoint cadence under `output/libero_spatial_ft/`. |

        Monitor the Hugging Face **Trainer** `loss` in the terminal. Checkpoints land under `output/libero_spatial_ft/`.

        # Step 7. Evaluate the fine-tuned model

        Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to **`/tmp/open_loop_eval/`**:

        ```bash
        CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \
            --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
            --embodiment-tag LIBERO_PANDA \
            --model-path output/libero_spatial_ft/checkpoint-2000/ \
            --traj-ids 0 1 2 \
            --action-horizon 16
        ```

        **How to read the run:** the terminal prints **per-trajectory MSE/MAE** and **averages**. The JPEGs under **`/tmp/open_loop_eval/`** overlay **predicted** vs **ground-truth** trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks.

        > [!TIP]
        > At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches **97.65%** in closed-loop sim.

        # Step 8. Run inference on a LIBERO sample (timing + actions)

        This step passes **LIBERO Spatial** observations through the **fine-tuned** checkpoint (the base model cannot run this embodiment). **`TORCHDYNAMO_DISABLE=1`** is included for GB300:

        ```bash
        TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
            --model-path output/libero_spatial_ft/checkpoint-2000/ \
            --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
            --embodiment-tag LIBERO_PANDA \
            --traj-ids 0 \
            --inference-mode pytorch \
            --action-horizon 8
        ```

        **What to inspect:** the script prints a **timing breakdown** (data processing, backbone, action head, end-to-end). Compare **MSE/MAE** and latency to Step 5's base-model smoke test. In eager mode (with `TORCHDYNAMO_DISABLE=1`), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect **~3–4 s/step** on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned `checkpoint-100`); a compiled torch 2.7 + cu128 stack with Triton support for `sm_103` can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks.

        # Step 9. Clean up

        ```bash
        deactivate
        cd ..
        rm -rf Isaac-GR00T
        ```

        Fine-tuned checkpoints under `output/libero_spatial_ft/` are removed with the repo. Copy them elsewhere first if you want to keep them.

        # Next steps

        - **Increase training steps** — `--max-steps 20000` for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput).
        - **Other LIBERO suites** — `libero_10_no_noops`, `libero_goal_no_noops`, `libero_object_no_noops` from **IPEC-COMMUNITY** on Hugging Face.
        - **Closed-loop sim** — LIBERO sim server/client: [LIBERO evaluation in Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/examples/LIBERO/README.md#evaluate-checkpoint).
        - **Custom embodiments** — [Fine-tune a new embodiment](https://github.com/NVIDIA/Isaac-GR00T/blob/n1.6-release/getting_started/finetune_new_embodiment.md) (LeRobot v2 + modality JSON).
        - **Tune more of the stack** — `--tune-llm` / `--tune-visual` raise memory use; probe batch size if you enable them.


    -
      id: troubleshooting

      label: Troubleshooting
      content: |
        # Common Issues

        ## Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)

        **Solution:**

        ```bash
        sudo apt-get install -y git-lfs
        git lfs install
        ```

        Remove any partial `Isaac-GR00T` directory, then clone again with `--recurse-submodules`.

        ## Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook

        **Cause:** The repository default branch (**`main`**) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts.

        **Solution:**

        ```bash
        cd Isaac-GR00T
        git fetch origin
        git checkout n1.6-release
        git submodule update --init --recursive
        ```

        Always run playbook commands from **`n1.6-release`** for **N1.6** + **GR00T-N1.6-3B**.

        ## Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes

        **Facts:** `scripts/deployment/dgpu/install_deps.sh` runs **`sudo apt-get`** to install **`ffmpeg`**, **`libaio-dev`**, and (on aarch64) FFmpeg **development** libraries for the **torchcodec** build. If **`/usr/local/cuda`** does not exist, it adds the NVIDIA CUDA apt repo and installs **`cuda-toolkit-12-8`**. It also installs **`uv`** into the user account if missing, then **`uv sync`** + **`uv pip install -e .`** into **`.venv`**.

        **Solution (policy-friendly):** Pre-install the same system packages and CUDA using your IT process, ensure **`nvcc`** works, then from the repo root:

        ```bash
        export PATH="$HOME/.local/bin:$PATH"
        uv sync
        uv pip install -e .
        ```

        On **aarch64**, you still need **`torchcodec`** in `.venv` or rely on the **PyAV patch** (Instructions Step 2) plus **`uv pip install av`**.

        ## Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64

        **Cause:** Upstream `pyproject.toml` lists pre-built `flash-attn==2.7.4.post1` wheels only for `linux_x86_64`. On **aarch64** (Grace + GB300), `uv` falls back to a from-source build that compiles ~72 CUDA kernels — typically **~2 hours** end-to-end.

        **Solution:** Pin to `flash-attn==2.8.1` and use the GitHub release's prebuilt aarch64 wheel. Edit `pyproject.toml` in the repo root:

        ```toml
        # under [project] dependencies, replace:
        # "flash-attn==2.7.4.post1",
        "flash-attn==2.8.1",

        # under [tool.uv.sources], add:
        flash-attn = [
            { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
              marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
        ]
        ```

        The `cu12torch2.10` aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — `uv sync` completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in.

        If you must keep `flash-attn==2.7.4.post1` (Option A path), expect the 2-hour build on first sync; subsequent `uv sync` invocations re-use the cached wheel.

        ## Issue: `install_deps.sh` fails building torchcodec

        **Solution:**

        Ensure the license confirmation env var is set:

        ```bash
        I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
        ```

        If the build still fails, install FFmpeg development libraries:

        ```bash
        sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
            libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
            pkg-config cmake build-essential pybind11-dev
        ```

        Then apply **Instructions Step 2** (PyAV patch) so training does not depend on a working **torchcodec** for indexed frame reads.

        ## Issue: `huggingface-cli download` fails with 401 Unauthorized

        **Solution:**

        ```bash
        echo $HF_TOKEN
        huggingface-cli whoami
        ```

        If the token is not set:

        ```bash
        export HF_TOKEN="your_token_here"
        ```

        Accept any required license or gated-model agreements on the Hugging Face model page.

        ## Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`

        **Cause:** The shared cache directory was previously created by a Docker container running as **root** (common on multi-user dev boxes that mount `~/.cache/huggingface` into containers without `--user`). The current user (`nvidia`) cannot write into it.

        **Solution:** point HF at a user-owned cache location for this run:

        ```bash
        export HF_HOME=$HOME/hf_cache_gr00t
        mkdir -p "$HF_HOME"
        huggingface-cli download nvidia/GR00T-N1.6-3B
        ```

        Re-export `HF_HOME` for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown `~/.cache/huggingface` back to your user.

        ## Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint

        **Cause:** Hugging Face's xet content-addressable backend occasionally returns transient `5xx`. This blocks dataset downloads even though the underlying files are reachable via the legacy backend.

        **Solution:** disable xet for the download:

        ```bash
        export HF_HUB_DISABLE_XET=1
        huggingface-cli download --repo-type dataset \
            IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
            --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
        ```

        ## Issue: `externally-managed-environment` or `pip` installs not going into `.venv`

        **Cause:** Debian/Ubuntu **PEP 668** blocks `pip install` onto the system Python. Mixing **`sudo pip`** with the project venv breaks the playbook.

        **Solution:**

        1. **`source .venv/bin/activate`** — prompt should show `(.venv)`.
        2. Use **`uv pip install ...`** (or **`python -m pip install ...`**) **only** with the venv activated — never `sudo pip` for this project.
        3. If the venv was created with a broken `pip`, recreate: `rm -rf .venv` and run **`uv sync`** again from the repo root (after `n1.6-release` checkout).

        ## Issue: CUDA out of memory during fine-tuning

        **Solution:**

        Reduce batch size:

        ```bash
        --global-batch-size 64
        ```

        Check for other GPU processes: `nvidia-smi`. **`--tune-llm`** / **`--tune-visual`** increase memory use substantially.

        ## Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)

        **Symptom:**

        ```text
        ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
        ```

        **Solution:**

        For **`scripts/deployment/standalone_inference_script.py`** (which may use **`torch.compile`**), prepend:

        ```bash
        TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
        ```

        This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and **`open_loop_eval.py`** typically run without this compile path; use the same prefix there **only** if you see the same crash.

        ## Issue: `ModuleNotFoundError: No module named 'gr00t'`

        **Solution:**

        ```bash
        source .venv/bin/activate
        pwd   # .../Isaac-GR00T
        ```

        ## Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`

        **Cause:** On **`n1.6-release`**, **`resolve_backend`** can select **`pyav`**, but stock **`get_frames_by_indices`** did not implement the **`pyav`** branch.

        **Solution:** Apply the playbook patch and install PyAV (see **Instructions Step 2** and `assets/patches/README.md`).

        ## Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps

        **Cause:** Fallback to **per-frame `ffmpeg` subprocess** decoding for **AV1** LIBERO clips; dataloaders starve the GPU.

        **Solution:**

        1. Apply the **PyAV patch** (Step 2) and **`uv pip install av`**.
        2. Optionally increase **`--dataloader-num-workers`** (for example **8**) if CPUs are free.

        **Expected noise after patching:** logs may repeat `Video backend 'torchcodec' is not available, falling back to 'pyav'` — that is normal if **torchcodec** is absent.

        ## Issue: Video decoding errors / `torchcodec` not found (general)

        **Solution:**

        Prefer the **PyAV patch + `av`** path above for LIBERO on GB300.

        If you must build **torchcodec** into `.venv` manually (aarch64), with FFmpeg dev packages installed:

        ```bash
        # Run this from inside the Isaac-GR00T repo root (the directory that
        # contains .venv). Capture its absolute path BEFORE changing directories
        # so we can still reach the virtualenv after cd'ing into /tmp/torchcodec.
        GR00T_ROOT="$(pwd)"

        # Sanity check — the virtualenv interpreter must already exist.
        test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; }

        # Clone the torchcodec source into /tmp/torchcodec (skip if already cloned).
        git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec
        cd /tmp/torchcodec

        # Build torchcodec into the Isaac-GR00T virtualenv using the absolute
        # path captured above (do NOT use the relative ".venv/bin/python" here —
        # the current directory is /tmp/torchcodec, which has no .venv).
        I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
          uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation
        ```

        CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the **PyAV patch** instead.

        ## Issue: Training loss is not decreasing

        **Solution:**

        At 2000 steps the model may still be early. If loss is flat after many steps:

        1. Verify modality file: `ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json`
        2. Confirm **`--embodiment-tag LIBERO_PANDA`**
        3. Try **`--learning-rate 5e-4`** for faster early movement on short runs

        ## Issue: `nvidia-smi` shows the wrong GPU

        **Solution:**

        ```bash
        nvidia-smi --query-gpu=index,name --format=csv,noheader
        CUDA_VISIBLE_DEVICES=<gb300_index> python ...
        ```

        ## Issue: OpenCV or decord cannot decode LIBERO AV1

        **Notes:** **OpenCV** often fails on **AV1** in LIBERO assets. **decord** may lack a compatible wheel for your platform. The **PyAV** patch path is the supported mitigation in this playbook.


    resources:
    - name: Isaac GR00T (GitHub)
      url: https://github.com/NVIDIA/Isaac-GR00T


    - name: GR00T N1.6 Model (HuggingFace)
      url: https://huggingface.co/nvidia/GR00T-N1.6-3B


    - name: GR00T N1.6 Research Blog
      url: https://research.nvidia.com/labs/gear/gr00t-n1_6/


    - name: GR00T N1 Paper
      url: https://arxiv.org/abs/2503.14734


    - name: LIBERO Benchmark
      url: https://libero-project.github.io/main.html