mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 04:22:21 +00:00

History

GitLab CI a9383bb067 chore: Regenerate all playbooks		2026-05-29 00:08:55 +00:00
..
assets	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
endpoint-production.yaml	chore: Regenerate all playbooks	2026-05-29 00:08:55 +00:00
endpoint-test.yaml	chore: Regenerate all playbooks	2026-05-27 16:00:20 +00:00
overview.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
README.md	chore: Regenerate all playbooks	2026-05-27 16:00:20 +00:00

README.md

Isaac GR00T N1.6 Fine-Tuning

Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station

Overview
Instructions
Troubleshooting

Overview

Basic idea

NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-family vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on a large mixture of robot demonstration data, then adapted to specific embodiments and tasks through fine-tuning.

High-level architecture (VLM + DiT action head), as in the upstream Isaac GR00T repo:

Source: NVIDIA Isaac GR00T — media/GR00T-reference-arch-diagram.png. If the local image above is missing, the upstream copy is at https://raw.githubusercontent.com/NVIDIA/Isaac-GR00T/n1.6-release/media/GR00T-reference-arch-diagram.png.

In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark on a DGX Station with GB300 (large unified memory). That setup supports a high global batch size (128) on a single GPU, which improves training throughput compared to typical 24–80 GB consumer or datacenter GPUs.

LIBERO Spatial (what you are fine-tuning on)

LIBERO Spatial is part of the LIBERO suite of simulated tabletop manipulation benchmarks. The spatial split emphasizes where objects need to be placed: tasks such as putting a bowl on a stove burner vs a plate, placing utensils in a mug vs next to it, or moving objects to left/right/front targets on the table. Episodes include third-person RGB video, proprioceptive state, language instructions, and continuous end-effector actions in a consistent LeRobot v2 layout. Understanding these constraints helps when you read training logs or open-loop evaluation plots.

What kind of fine-tuning this playbook uses

This playbook runs the default Isaac GR00T fine-tuning recipe from launch_finetune.py: not full-model weight updates of the entire 3B VLM. In the stock configuration, training focuses on the action head (DiT) and projector / adapter paths that map observations into the action model, with strong state dropout and color jitter so the policy leans on vision. Optional flags such as --tune-llm or --tune-visual (mentioned under Next steps) trade compute and memory for updating more of the backbone. LoRA is not the default here; if your team uses LoRA or other PEFT variants, treat that as a separate configuration branch from this playbook.

NVIDIA DGX Station (why this hardware)

DGX Station is a deskside AI system built for large-memory GPU training and inference (this playbook targets GB300 with 284 GB HBM3e). Beyond robotics, the same class of machine supports large-model fine-tuning, RAG serving, multi-modal training, and CUDA research where single-GPU memory and bandwidth dominate. For GR00T, the headline benefit is fitting much larger batch sizes per GPU than on smaller cards, which stabilizes gradients and improves samples per second when the data pipeline keeps up.

What you'll accomplish

Check out the n1.6-release branch of Isaac GR00T so commands, embodiment tags, and demo_data/ match GR00T N1.6
Set up the environment with uv (project-local .venv) and understand what the optional install_deps.sh script changes on the system
Apply the recommended PyAV get_frames_by_indices patch when torchcodec is unavailable so LIBERO AV1 video decoding does not stall on an ffmpeg subprocess fallback
Verify the base model, fine-tune on LIBERO Spatial at batch size 128, run open-loop evaluation, and measure inference latency (with GB300 / Blackwell TorchDynamo compilation notes)

What to know before starting

Familiarity with Python virtual environments (source .venv/bin/activate)
Familiarity with PyTorch training concepts (batch size, loss, checkpoints)
Basic robot manipulation vocabulary (trajectories, observations, actions)
Comfort running commands that may use sudo for system packages (or use the documented user-space alternative)

Prerequisites

NVIDIA DGX Station with GB300 (Blackwell SM103, 284 GB HBM3e)
CUDA toolkit usable by PyTorch: nvcc --version should show CUDA 12.8+ (often already under /usr/local/cuda on DGX images)
Git and Git LFS (git lfs version) — LFS is required for some demo assets and submodules; install with sudo apt-get install -y git-lfs then git lfs install if missing
Hugging Face account and HF_TOKEN for model and dataset downloads
Network access to Hugging Face, GitHub, and PyPI
At least ~30 GB free disk for .venv, checkpoints, and the LIBERO download

Time & risk

Duration: ~45 minutes end-to-end when the video backend is healthy (setup, downloads, ~20–25 min training at 2000 steps, eval and inference)
Risks: scripts/deployment/dgpu/install_deps.sh performs system-level apt operations and may install the CUDA 12.8 toolkit if /usr/local/cuda is absent (see Instructions). Model download requires Hugging Face authentication.
Rollback: Remove the cloned Isaac-GR00T directory and optionally rm -rf ~/.local/share/uv if you want to reclaim uv caches. Reverting apt-installed packages is a separate admin task; the playbook does not uninstall them automatically.
Last Updated: 05/26/2026
- First Publication

Instructions

Step 1. Clone Isaac GR00T and install dependencies

1a. Git LFS (required for a clean clone)

If git clone fails with errors about Git LFS or missing pointer files, install and initialize LFS, then remove any partial Isaac-GR00T directory and clone again:

sudo apt-get update
sudo apt-get install -y git-lfs
git lfs install

1b. Clone and check out `n1.6-release`

The main branch tracks ongoing development (for example newer GR00T milestones) and does not always match this N1.6 playbook. Embodiment tags such as GR1, paths like demo_data/gr1.PickNPlace, and tutorial scripts are aligned with the n1.6-release branch.

git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive

1c. Install Python dependencies

Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)

This script is the supported path. It may make system-level changes:

Runs apt-get update and installs ffmpeg and libaio-dev
If /usr/local/cuda is missing, adds the NVIDIA CUDA apt repository and installs cuda-toolkit-12-8
Installs uv into your user account if needed, then runs uv sync and uv pip install -e . into the project .venv
On aarch64 only: installs FFmpeg development packages and builds torchcodec from source into .venv

I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh

Option B — User-space only (no `install_deps.sh`)

Use this only when CUDA 12.8+ is already installed, system ffmpeg / libaio-dev are already present, and your policy forbids the script's apt or CUDA steps. From the Isaac-GR00T repo root, install uv if needed, then:

command -v uv >/dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="/usr/local/cuda/bin:$HOME/.local/bin:$PATH"
export CUDA_HOME=/usr/local/cuda
uv sync
uv pip install -e .

You still need a working video backend for LIBERO (see Step 2). On aarch64, building torchcodec inside .venv without the script is possible but manual; see Troubleshooting.

Important

PATH and CUDA_HOME matter on multi-toolkit hosts. If the system has both an old Ubuntu nvidia-cuda-toolkit package (/usr/bin/nvcc ≈ 12.0) and a current NVIDIA CUDA repo install (/usr/local/cuda-13.x/bin/nvcc), uv will pick whichever appears first on PATH. Putting /usr/local/cuda/bin first (and exporting CUDA_HOME) is required for flash-attn's source build to find the matching toolkit. Verify with nvcc --version after the export.

Warning

flash-attn build on aarch64 takes ~2 hours from source. The upstream pyproject.toml only lists pre-built flash-attn==2.7.4.post1 wheels for x86_64; on aarch64 (Grace + GB300), uv sync falls back to compiling ~72 CUDA kernels from source. A faster route is to pin flash-attn==2.8.1 and reuse the GitHub release's prebuilt aarch64 wheel:
# In pyproject.toml under [project] dependencies:
"flash-attn==2.8.1",

# In [tool.uv.sources]:
flash-attn = [
    { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
      marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]
With this pin, uv sync finishes in ~1 minute on aarch64 instead of ~2 hours. The wheel works against torch 2.10. Verified on GB300 + CUDA 13.1 in this playbook's validation run.

Activate the virtual environment:

source .venv/bin/activate

Verify GPU access:

CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_name(0))"

Expected output: NVIDIA GB300

Note

Examples in this playbook use CUDA_VISIBLE_DEVICES=0 because the GB300 is at index 0 on a single-GPU Station. On a multi-GPU Station (for example RTX PRO 6000 + GB300), the GB300 may be at a different index — run nvidia-smi --query-gpu=index,name --format=csv,noheader, find the GB300 row, and substitute that index everywhere CUDA_VISIBLE_DEVICES=0 appears below.

Step 2. PyAV patch for LIBERO video (strongly recommended)

On many stacks torchcodec fails to import or build, the resolver falls back to pyav, and stock n1.6-release can raise NotImplementedError from get_frames_by_indices for the pyav backend (fallback order is already torchcodec → decord → pyav → ffmpeg). Without this patch, training may appear hung: GPU idle, no traceback, while ffmpeg spawns per-frame decode work on the CPU.

From the Isaac-GR00T repo root with n1.6-release checked out and .venv activated:

git apply /path/to/dgx-station-playbooks/nvidia/station-gr00t/assets/patches/001-pyav-get-frames-by-indices.patch
uv pip install av

If you copied nvidia/station-gr00t/assets/patches/ into the Isaac-GR00T root instead, use git apply assets/patches/001-pyav-get-frames-by-indices.patch.

Details and re-apply rules: nvidia/station-gr00t/assets/patches/README.md.

After patching, repeated log lines such as Video backend 'torchcodec' is not available, falling back to 'pyav' are expected and noisy but not fatal.

Step 3. Set up HuggingFace authentication

export HF_TOKEN="your_huggingface_token"

Get a token from https://huggingface.co/settings/tokens if you don't have one.

Step 4. Download the dataset and model

Download the LIBERO Spatial dataset and the GR00T N1.6 base model:

## Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
    --repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
    --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

## Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
    examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/

## Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B

Note

HF cache permission errors: If huggingface-cli download fails with Permission denied: '/home/.../.cache/huggingface/hub/...', the cache directory was previously created by a Docker container running as root (common on shared dev boxes). Point HF at a user-owned cache for this run:
export HF_HOME=$HOME/hf_cache_gr00t
Transient xet-read-token 500 errors: Hugging Face's xet backend occasionally returns 500 Internal Server Error for dataset downloads. Disable it:
export HF_HUB_DISABLE_XET=1

Verify the dataset is ready:

ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json

Expected result: the command prints the full path to modality.json (and ls exits 0). That confirms the merged modality file exists next to the downloaded LeRobot dataset metadata.

Step 5. Verify the base model loads and runs

Confirm the GR00T N1.6 base model loads and produces actions using the GR1 demo shipped on n1.6-release:

TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
    --model-path nvidia/GR00T-N1.6-3B \
    --dataset-path demo_data/gr1.PickNPlace \
    --embodiment-tag GR1 \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8 \
    --steps 32

TORCHDYNAMO_DISABLE=1 avoids torch.compile / Triton paths that can fail on GB300 with ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'. Keep it on all standalone_inference_script.py invocations in this playbook unless you have a Triton build that supports SM103.

You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline work before a long fine-tuning run.

Note

The base model's pretrained processor does not include the LIBERO_PANDA embodiment configuration, so you cannot run this standalone script on the LIBERO dataset with the base checkpoint alone. The LIBERO modality config is registered during fine-tuning. That is expected — LIBERO is a post-training benchmark.

Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial

Fine-tune the base model on LIBERO Spatial. DGX Station's GB300 GPU with 284 GB HBM3e allows a global batch size of 128 — roughly several times what fits on a typical 80 GB GPU. Larger batches stabilize gradients and improve wall-clock throughput when the dataloader keeps the GPU fed.

CUDA_VISIBLE_DEVICES=0 python \
    gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --num-gpus 1 \
    --output-dir output/libero_spatial_ft \
    --save-steps 500 \
    --save-total-limit 5 \
    --max-steps 2000 \
    --global-batch-size 128 \
    --learning-rate 1e-4 \
    --warmup-ratio 0.05 \
    --weight-decay 1e-5 \
    --state-dropout-prob 0.8 \
    --color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
    --dataloader-num-workers 4

If GPU utilization stays near zero for many minutes while the process is alive, suspect video decoding (see Step 2 patch and Troubleshooting). You can try --dataloader-num-workers 8 if CPU cores are available.

Training runs for 2000 steps at batch size 128 and takes approximately 20–25 minutes on GB300 when torchcodec is the active video backend.

Important

With the PyAV fallback (Step 2 patch + no torchcodec), expect ~5–6 s per step instead of <1 s — so 2000 steps is closer to 2.5–3 hours, and GPU utilization sits in the 3–30 % range while CPU-side video decoding starves the GPU. To validate the workflow without the long wait, lower --max-steps (e.g. 100) and --save-steps (e.g. 50); loss should still drop visibly (validated drop 1.07 → 0.63 in 100 steps in this playbook's GB300 run). If you need full-throughput training, build torchcodec from source (Troubleshooting → "Video decoding errors") or run Option A which builds it for you.

Note

This playbook uses 2000 steps to keep execution time under an hour. For production-quality results closer to the published 97.65% success rate on LIBERO Spatial, increase to 20,000 steps (--max-steps 20000). Published settings used batch size 640 across 8 GPUs — 128 on one GB300 exceeds the per-GPU batch in that reference.

What the training flags mean:

Flag	Value	Purpose
`--global-batch-size`	128	Total samples per training step; enabled by GB300 memory.
`--state-dropout-prob`	0.8	Drops proprioceptive state 80% of the time so the model relies on vision.
`--color-jitter-params`	brightness/contrast/saturation/hue	Photometric augmentation for lighting robustness.
`--warmup-ratio`	0.05	Linear LR warmup over the first 5% of steps.
`--save-steps`	500	Checkpoint cadence under `output/libero_spatial_ft/`.

Monitor the Hugging Face Trainer loss in the terminal. Checkpoints land under output/libero_spatial_ft/.

Step 7. Evaluate the fine-tuned model

Open-loop evaluation compares predicted actions to dataset ground truth and writes plots to /tmp/open_loop_eval/:

CUDA_VISIBLE_DEVICES=0 python gr00t/eval/open_loop_eval.py \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --traj-ids 0 1 2 \
    --action-horizon 16

How to read the run: the terminal prints per-trajectory MSE/MAE and averages. The JPEGs under /tmp/open_loop_eval/ overlay predicted vs ground-truth trajectories per action dimension (translation, rotation, gripper). Use them to confirm the policy tracks pick-and-place phases and gripper open/close timing on spatial tasks.

Tip

At 2000 steps you should see clear improvement over a random policy; at 20,000 steps, published LIBERO Spatial success reaches 97.65% in closed-loop sim.

Step 8. Run inference on a LIBERO sample (timing + actions)

This step passes LIBERO Spatial observations through the fine-tuned checkpoint (the base model cannot run this embodiment). TORCHDYNAMO_DISABLE=1 is included for GB300:

TORCHDYNAMO_DISABLE=1 CUDA_VISIBLE_DEVICES=0 python scripts/deployment/standalone_inference_script.py \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8

What to inspect: the script prints a timing breakdown (data processing, backbone, action head, end-to-end). Compare MSE/MAE and latency to Step 5's base-model smoke test. In eager mode (with TORCHDYNAMO_DISABLE=1), per-step latency on GB300 depends heavily on the torch + CUDA stack — expect ~3–4 s/step on torch 2.10 + cu130 in eager mode (validated in this playbook's run on a fine-tuned checkpoint-100); a compiled torch 2.7 + cu128 stack with Triton support for sm_103 can be much faster. Treat the "Backbone vs Action head" split as the more stable signal across stacks.

Step 9. Clean up

deactivate
cd ..
rm -rf Isaac-GR00T

Fine-tuned checkpoints under output/libero_spatial_ft/ are removed with the repo. Copy them elsewhere first if you want to keep them.

Next steps

Increase training steps — --max-steps 20000 for stronger LIBERO Spatial alignment (~3.5 hours at the same throughput).
Other LIBERO suites — libero_10_no_noops, libero_goal_no_noops, libero_object_no_noops from IPEC-COMMUNITY on Hugging Face.
Closed-loop sim — LIBERO sim server/client: LIBERO evaluation in Isaac GR00T.
Custom embodiments — Fine-tune a new embodiment (LeRobot v2 + modality JSON).
Tune more of the stack — --tune-llm / --tune-visual raise memory use; probe batch size if you enable them.

Troubleshooting

Common Issues

Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)

Solution:

sudo apt-get install -y git-lfs
git lfs install

Remove any partial Isaac-GR00T directory, then clone again with --recurse-submodules.

Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook

Cause: The repository default branch (main) may track a newer GR00T line (for example N1.7) with different embodiment tags and demo layouts.

Solution:

cd Isaac-GR00T
git fetch origin
git checkout n1.6-release
git submodule update --init --recursive

Always run playbook commands from n1.6-release for N1.6 + GR00T-N1.6-3B.

Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes

Facts: scripts/deployment/dgpu/install_deps.sh runs sudo apt-get to install ffmpeg, libaio-dev, and (on aarch64) FFmpeg development libraries for the torchcodec build. If /usr/local/cuda does not exist, it adds the NVIDIA CUDA apt repo and installs cuda-toolkit-12-8. It also installs uv into the user account if missing, then uv sync + uv pip install -e . into .venv.

Solution (policy-friendly): Pre-install the same system packages and CUDA using your IT process, ensure nvcc works, then from the repo root:

export PATH="$HOME/.local/bin:$PATH"
uv sync
uv pip install -e .

On aarch64, you still need torchcodec in .venv or rely on the PyAV patch (Instructions Step 2) plus uv pip install av.

Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64

Cause: Upstream pyproject.toml lists pre-built flash-attn==2.7.4.post1 wheels only for linux_x86_64. On aarch64 (Grace + GB300), uv falls back to a from-source build that compiles ~72 CUDA kernels — typically ~2 hours end-to-end.

Solution: Pin to flash-attn==2.8.1 and use the GitHub release's prebuilt aarch64 wheel. Edit pyproject.toml in the repo root:

## under [project] dependencies, replace:
## "flash-attn==2.7.4.post1",
"flash-attn==2.8.1",

## under [tool.uv.sources], add:
flash-attn = [
    { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_aarch64.whl",
      marker = "sys_platform == 'linux' and platform_machine == 'aarch64' and python_version == '3.12'" },
]

The cu12torch2.10 aarch64 wheel works against torch 2.10 (cu128 or cu130 builds). Validated on GB300 + CUDA 13.1 — uv sync completes in ~1 minute instead of ~2 hours. Track upstream Isaac-GR00T for a future commit that bakes this in.

If you must keep flash-attn==2.7.4.post1 (Option A path), expect the 2-hour build on first sync; subsequent uv sync invocations re-use the cached wheel.

Issue: `install_deps.sh` fails building torchcodec

Solution:

Ensure the license confirmation env var is set:

I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh

If the build still fails, install FFmpeg development libraries:

sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
    libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
    pkg-config cmake build-essential pybind11-dev

Then apply Instructions Step 2 (PyAV patch) so training does not depend on a working torchcodec for indexed frame reads.

Issue: `huggingface-cli download` fails with 401 Unauthorized

Solution:

echo $HF_TOKEN
huggingface-cli whoami

If the token is not set:

export HF_TOKEN="your_token_here"

Accept any required license or gated-model agreements on the Hugging Face model page.

Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`

Cause: The shared cache directory was previously created by a Docker container running as root (common on multi-user dev boxes that mount ~/.cache/huggingface into containers without --user). The current user (nvidia) cannot write into it.

Solution: point HF at a user-owned cache location for this run:

export HF_HOME=$HOME/hf_cache_gr00t
mkdir -p "$HF_HOME"
huggingface-cli download nvidia/GR00T-N1.6-3B

Re-export HF_HOME for the rest of the playbook (Step 5 onward) so model loads find the right cache. To permanently un-stick the original cache, ask whoever owns the container session to chown ~/.cache/huggingface back to your user.

Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint

Cause: Hugging Face's xet content-addressable backend occasionally returns transient 5xx. This blocks dataset downloads even though the underlying files are reachable via the legacy backend.

Solution: disable xet for the download:

export HF_HUB_DISABLE_XET=1
huggingface-cli download --repo-type dataset \
    IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
    --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

Issue: `externally-managed-environment` or `pip` installs not going into `.venv`

Cause: Debian/Ubuntu PEP 668 blocks pip install onto the system Python. Mixing sudo pip with the project venv breaks the playbook.

Solution:

source .venv/bin/activate — prompt should show (.venv).
Use uv pip install ... (or python -m pip install ...) only with the venv activated — never sudo pip for this project.
If the venv was created with a broken pip, recreate: rm -rf .venv and run uv sync again from the repo root (after n1.6-release checkout).

Issue: CUDA out of memory during fine-tuning

Solution:

Reduce batch size:

--global-batch-size 64

Check for other GPU processes: nvidia-smi. --tune-llm / --tune-visual increase memory use substantially.

Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)

Symptom:

ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'

Solution:

For scripts/deployment/standalone_inference_script.py (which may use torch.compile), prepend:

TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...

This forces eager inference (higher latency per step but stable on SM103 until Triton catches up). Fine-tuning and open_loop_eval.py typically run without this compile path; use the same prefix there only if you see the same crash.

Issue: `ModuleNotFoundError: No module named 'gr00t'`

Solution:

source .venv/bin/activate
pwd   # .../Isaac-GR00T

Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`

Cause: On n1.6-release, resolve_backend can select pyav, but stock get_frames_by_indices did not implement the pyav branch.

Solution: Apply the playbook patch and install PyAV (see Instructions Step 2 and assets/patches/README.md).

Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps

Cause: Fallback to per-frame ffmpeg subprocess decoding for AV1 LIBERO clips; dataloaders starve the GPU.

Solution:

Apply the PyAV patch (Step 2) and uv pip install av.
Optionally increase --dataloader-num-workers (for example 8) if CPUs are free.

Expected noise after patching: logs may repeat Video backend 'torchcodec' is not available, falling back to 'pyav' — that is normal if torchcodec is absent.

Issue: Video decoding errors / `torchcodec` not found (general)

Solution:

Prefer the PyAV patch + av path above for LIBERO on GB300.

If you must build torchcodec into .venv manually (aarch64), with FFmpeg dev packages installed:

## Run this from inside the Isaac-GR00T repo root (the directory that
## contains .venv). Capture its absolute path BEFORE changing directories
## so we can still reach the virtualenv after cd'ing into /tmp/torchcodec.
GR00T_ROOT="$(pwd)"

## Sanity check — the virtualenv interpreter must already exist.
test -x "$GR00T_ROOT/.venv/bin/python" || { echo "Not in Isaac-GR00T root (missing .venv/bin/python)"; }

## Clone the torchcodec source into /tmp/torchcodec (skip if already cloned).
git clone https://github.com/pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec

## Build torchcodec into the Isaac-GR00T virtualenv using the absolute
## path captured above (do NOT use the relative ".venv/bin/python" here —
## the current directory is /tmp/torchcodec, which has no .venv).
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
  uv pip install --python "$GR00T_ROOT/.venv/bin/python" . --no-build-isolation

CUDA-enabled builds can fail when system FFmpeg or CUDA does not match torchcodec expectations — in that case use the PyAV patch instead.

Issue: Training loss is not decreasing

Solution:

At 2000 steps the model may still be early. If loss is flat after many steps:

Verify modality file: ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
Confirm --embodiment-tag LIBERO_PANDA
Try --learning-rate 5e-4 for faster early movement on short runs

Issue: `nvidia-smi` shows the wrong GPU

Solution:

nvidia-smi --query-gpu=index,name --format=csv,noheader
CUDA_VISIBLE_DEVICES=<gb300_index> python ...

Issue: OpenCV or decord cannot decode LIBERO AV1

Notes: OpenCV often fails on AV1 in LIBERO assets. decord may lack a compatible wheel for your platform. The PyAV patch path is the supported mitigation in this playbook.

README.md Unescape Escape

Isaac GR00T N1.6 Fine-Tuning

Table of Contents

Overview

Basic idea

LIBERO Spatial (what you are fine-tuning on)

What kind of fine-tuning this playbook uses

NVIDIA DGX Station (why this hardware)

What you'll accomplish

What to know before starting

Prerequisites

Time & risk

Instructions

Step 1. Clone Isaac GR00T and install dependencies

1a. Git LFS (required for a clean clone)

1b. Clone and check out n1.6-release

1c. Install Python dependencies

Option A — install_deps.sh (matches upstream docs; uses sudo)

Option B — User-space only (no install_deps.sh)

Step 2. PyAV patch for LIBERO video (strongly recommended)

Step 3. Set up HuggingFace authentication

Step 4. Download the dataset and model

Step 5. Verify the base model loads and runs

Step 6. Fine-tune GR00T N1.6 on LIBERO Spatial

Step 7. Evaluate the fine-tuned model

Step 8. Run inference on a LIBERO sample (timing + actions)

Step 9. Clean up

Next steps

Troubleshooting

Common Issues

Issue: git clone fails or demo videos are tiny / missing (Git LFS)

Issue: GR1, demo_data/gr1.PickNPlace, or scripts do not match the playbook

Issue: install_deps.sh is not allowed on your machine (policy) or you need to know what it changes

Issue: uv sync (Option B) appears stuck for hours building flash-attn on aarch64

Issue: install_deps.sh fails building torchcodec

Issue: huggingface-cli download fails with 401 Unauthorized

Issue: huggingface-cli download fails with Permission denied: '/home/.../.cache/huggingface/hub/...'

Issue: huggingface-cli download returns 500 Internal Server Error from the xet-read-token endpoint

Issue: externally-managed-environment or pip installs not going into .venv

Issue: CUDA out of memory during fine-tuning

Issue: Triton / PTXAS errors about sm_103a (GB300 / Blackwell)

Issue: ModuleNotFoundError: No module named 'gr00t'

Issue: NotImplementedError in get_frames_by_indices when backend is pyav

Issue: Training “hangs” — low GPU utilization, no traceback, very slow steps

Issue: Video decoding errors / torchcodec not found (general)

Issue: Training loss is not decreasing

Issue: nvidia-smi shows the wrong GPU

Issue: OpenCV or decord cannot decode LIBERO AV1

README.md

1b. Clone and check out `n1.6-release`

Option A — `install_deps.sh` (matches upstream docs; uses `sudo`)

Option B — User-space only (no `install_deps.sh`)

Issue: `git clone` fails or demo videos are tiny / missing (Git LFS)

Issue: `GR1`, `demo_data/gr1.PickNPlace`, or scripts do not match the playbook

Issue: `install_deps.sh` is not allowed on your machine (policy) or you need to know what it changes

Issue: `uv sync` (Option B) appears stuck for hours building `flash-attn` on aarch64

Issue: `install_deps.sh` fails building torchcodec

Issue: `huggingface-cli download` fails with 401 Unauthorized

Issue: `huggingface-cli download` fails with `Permission denied: '/home/.../.cache/huggingface/hub/...'`

Issue: `huggingface-cli download` returns `500 Internal Server Error` from the `xet-read-token` endpoint

Issue: `externally-managed-environment` or `pip` installs not going into `.venv`

Issue: Triton / PTXAS errors about `sm_103a` (GB300 / Blackwell)

Issue: `ModuleNotFoundError: No module named 'gr00t'`

Issue: `NotImplementedError` in `get_frames_by_indices` when backend is `pyav`

Issue: Video decoding errors / `torchcodec` not found (general)

Issue: `nvidia-smi` shows the wrong GPU