# Isaac GR00T N1.6 Fine-Tuning

> Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station


## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
  - [Issue: `install_deps.sh` fails building torchcodec](#issue-installdepssh-fails-building-torchcodec)
  - [Issue: `huggingface-cli download` fails with 401 Unauthorized](#issue-huggingface-cli-download-fails-with-401-unauthorized)
  - [Issue: CUDA out of memory during fine-tuning](#issue-cuda-out-of-memory-during-fine-tuning)
  - [Issue: Triton/PTXAS errors about `sm_103a` during inference](#issue-tritonptxas-errors-about-sm103a-during-inference)
  - [Issue: `ModuleNotFoundError: No module named 'gr00t'`](#issue-modulenotfounderror-no-module-named-gr00t)
  - [Issue: Training loss is not decreasing](#issue-training-loss-is-not-decreasing)
  - [Issue: `nvidia-smi` shows the wrong GPU](#issue-nvidia-smi-shows-the-wrong-gpu)
  - [Issue: Slow data loading during training](#issue-slow-data-loading-during-training)
  - [Issue: Video decoding errors (`NotImplementedError` or torchcodec not found)](#issue-video-decoding-errors-notimplementederror-or-torchcodec-not-found)

---

## Overview

## Basic idea

NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning.

In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of **128**, far exceeding the typical 32–64 used on smaller GPUs, which accelerates convergence and improves training throughput.

## What you'll accomplish

- Set up the Isaac GR00T environment using `uv` (fast, reproducible Python packaging)
- Verify the pre-trained base model loads and runs inference
- Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128
- Evaluate the fine-tuned model using open-loop evaluation and measure inference latency

## What to know before starting

- Familiarity with Python virtual environments
- Familiarity with PyTorch training workflows (epochs, batch size, loss curves)
- General understanding of robot manipulation concepts (actions, observations, trajectories)

## Prerequisites

- NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e)
- CUDA toolkit installed: `nvcc --version` should show CUDA 12.8+
- Git installed: `git --version`
- HuggingFace account with access token (for model and dataset downloads)
- Network access to HuggingFace, GitHub, and PyPI
- At least 30 GB of free disk space (venv + model + dataset)

## Time & risk

* **Duration:** ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation)
* **Risks:** Model download requires HuggingFace authentication; `uv sync` installs packages into a project-local `.venv`
* **Rollback:** Delete the cloned `Isaac-GR00T` directory to restore state. No system-level changes are made.
* **Last Updated:** 04/06/2026
  * First Publication

## Instructions

## Step 1. Clone Isaac GR00T and install dependencies

Clone the repository and run the dGPU install script. This uses `uv` for fast, reproducible dependency management and automatically detects the aarch64 architecture:

```bash
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
```

The install script:
- Installs system dependencies (`ffmpeg`, `libaio-dev`)
- Installs `uv` if not present
- Runs `uv sync` to create a `.venv` with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8)
- Builds `torchcodec` from source on aarch64 (required for video decoding)

Activate the virtual environment:

```bash
source .venv/bin/activate
```

Verify GPU access:

```bash
CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))"
```

Expected output: `NVIDIA GB300`

> [!NOTE]
> Replace `CUDA_VISIBLE_DEVICES=1` with the index of your GB300 GPU throughout this playbook. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` to find it.

## Step 2. Set up HuggingFace authentication

```bash
export HF_TOKEN="your_huggingface_token"
```

Get a token from https://huggingface.co/settings/tokens if you don't have one.

## Step 3. Download the dataset and model

Download the LIBERO Spatial dataset and the GR00T N1.6 base model:

```bash
## Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
    --repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
    --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

## Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
    examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/

## Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B
```

Verify the dataset is ready:

```bash
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
```

## Step 4. Verify the base model loads and runs

Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification:

```bash
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
    --model-path nvidia/GR00T-N1.6-3B \
    --dataset-path demo_data/gr1.PickNPlace \
    --embodiment-tag GR1 \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8 \
    --steps 32
```

You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run.

> [!NOTE]
> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. See Troubleshooting for details.

> [!NOTE]
> The base model's pretrained processor does not include the `LIBERO_PANDA` embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark.

## Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial

Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of **128** — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour.

```bash
CUDA_VISIBLE_DEVICES=1 python \
    gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --num-gpus 1 \
    --output-dir output/libero_spatial_ft \
    --save-steps 500 \
    --save-total-limit 5 \
    --max-steps 2000 \
    --global-batch-size 128 \
    --learning-rate 1e-4 \
    --warmup-ratio 0.05 \
    --weight-decay 1e-5 \
    --state-dropout-prob 0.8 \
    --color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
    --dataloader-num-workers 4
```

Training runs for **2000 steps** at batch size 128 and takes approximately 20–25 minutes on the GB300.

> [!NOTE]
> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to **20,000 steps** by changing `--max-steps 20000`. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks.

**What the training flags mean:**

| Flag | Value | Purpose |
|------|-------|---------|
| `--global-batch-size` | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. |
| `--state-dropout-prob` | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. |
| `--color-jitter-params` | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. |
| `--warmup-ratio` | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). |
| `--save-steps` | 500 | Saves a checkpoint every 500 steps. |

Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the `loss` field decreasing over time. Checkpoints are saved every 500 steps to `output/libero_spatial_ft/`.

## Step 6. Evaluate the fine-tuned model

Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset:

```bash
CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --traj-ids 0 1 2 \
    --action-horizon 16
```

The evaluation outputs:

- **Per-trajectory MSE and MAE** printed to the terminal
- **Average MSE** across all evaluated trajectories
- **JPEG visualizations** saved to `/tmp/open_loop_eval/` showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper)

Key things to look for in the plots:

- **Predicted trajectories** (orange line) should closely track the **ground truth** (blue line)
- **Gripper timing** — opening and closing at the correct moments
- **Lower MSE** indicates better action prediction accuracy

> [!TIP]
> Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation.

## Step 7. Run inference timing benchmark

Measure the fine-tuned model's per-step inference latency:

```bash
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8
```

> [!NOTE]
> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. This runs inference in eager mode. See Troubleshooting for details.

The timing output breaks down into:

- **Data processing** — loading and preprocessing the observation
- **Backbone** — vision-language model forward pass
- **Action head** — diffusion transformer denoising (4 steps)
- **End-to-end** — total inference time per action chunk

In eager mode (without `torch.compile`), expect ~240 ms per step. With `torch.compile` working, expect ~38 ms per step comparable to H100.

## Step 8. Clean up

To remove the environment:

```bash
deactivate
cd ..
rm -rf Isaac-GR00T
```

Your fine-tuned checkpoints in `output/libero_spatial_ft/` are deleted with the repo. Copy them elsewhere first if you want to keep them.

## Next steps

- **Increase training steps** — Change `--max-steps` to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps).
- **Try other LIBERO suites** — Download `libero_10_no_noops`, `libero_goal_no_noops`, or `libero_object_no_noops` datasets from the `IPEC-COMMUNITY` HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%.
- **Closed-loop simulation evaluation** — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the [LIBERO evaluation guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/examples/LIBERO/README.md#evaluate-checkpoint) for server-client setup.
- **Custom embodiments** — Fine-tune GR00T on your own robot data by following the [custom embodiment guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/finetune_new_embodiment.md). Requires converting your data to LeRobot v2 format and defining a modality config.
- **Experiment with batch size** — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling `--tune-llm` or `--tune-visual` increases memory usage significantly.

## Troubleshooting

## Common Issues

### Issue: `install_deps.sh` fails building torchcodec

**Solution:**

Ensure the license confirmation env var is set:

```bash
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
```

If the build still fails, ensure FFmpeg dev libraries are installed:

```bash
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
    libavcodec-dev libavutil-dev libswresample-dev libswscale-dev
```

### Issue: `huggingface-cli download` fails with 401 Unauthorized

**Solution:**

Verify your HuggingFace token is set and valid:

```bash
echo $HF_TOKEN
huggingface-cli whoami
```

If the token is not set:

```bash
export HF_TOKEN="your_token_here"
```

Make sure you have accepted any required model agreements on the HuggingFace model page.

### Issue: CUDA out of memory during fine-tuning

**Solution:**

If fine-tuning fails with an OOM error at batch size 128, reduce the batch size:

```bash
--global-batch-size 64
```

Also check that no other processes are using GPU memory:

```bash
nvidia-smi
```

If you are tuning additional model components (`--tune-llm` or `--tune-visual`), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient.

### Issue: Triton/PTXAS errors about `sm_103a` during inference

**Solution:**

The bundled Triton version may not yet support SM103 (GB300). This causes errors like:

```
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
```

Disable `torch.compile` by prepending:

```bash
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
```

This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default.

### Issue: `ModuleNotFoundError: No module named 'gr00t'`

**Solution:**

The virtual environment is not activated. Run:

```bash
source .venv/bin/activate
```

Verify you are in the Isaac-GR00T directory:

```bash
pwd
## Should show: .../Isaac-GR00T
```

### Issue: Training loss is not decreasing

**Solution:**

At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps:

1. Verify the dataset was downloaded correctly and the modality config was copied:
   ```bash
   ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
   ```

2. Check that the correct embodiment tag is used (`LIBERO_PANDA`, not `NEW_EMBODIMENT`).

3. Try increasing the learning rate to `5e-4` for faster initial convergence on short runs.

### Issue: `nvidia-smi` shows the wrong GPU

**Solution:**

On DGX Station, the GB300 may not be device 0. Find the correct index:

```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```

Use the GB300's index with `CUDA_VISIBLE_DEVICES`:

```bash
CUDA_VISIBLE_DEVICES=1 python ...
```

### Issue: Slow data loading during training

**Solution:**

Increase the number of dataloader workers:

```bash
--dataloader-num-workers 8
```

### Issue: Video decoding errors (`NotImplementedError` or torchcodec not found)

**Solution:**

The `install_deps.sh` script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall:

```bash
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
    libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
    pkg-config cmake build-essential pybind11-dev

git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
    uv pip install --python .venv/bin/python . --no-build-isolation
cd - && rm -rf /tmp/torchcodec
```