dgx-spark-playbooks/nvidia/station-gr00t/README.md

417 lines
16 KiB
Markdown
Raw Normal View History

2026-05-26 18:25:53 +00:00
# Isaac GR00T N1.6 Fine-Tuning
> Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
- [Issue: `install_deps.sh` fails building torchcodec](#issue-installdepssh-fails-building-torchcodec)
- [Issue: `huggingface-cli download` fails with 401 Unauthorized](#issue-huggingface-cli-download-fails-with-401-unauthorized)
- [Issue: CUDA out of memory during fine-tuning](#issue-cuda-out-of-memory-during-fine-tuning)
- [Issue: Triton/PTXAS errors about `sm_103a` during inference](#issue-tritonptxas-errors-about-sm103a-during-inference)
- [Issue: `ModuleNotFoundError: No module named 'gr00t'`](#issue-modulenotfounderror-no-module-named-gr00t)
- [Issue: Training loss is not decreasing](#issue-training-loss-is-not-decreasing)
- [Issue: `nvidia-smi` shows the wrong GPU](#issue-nvidia-smi-shows-the-wrong-gpu)
- [Issue: Slow data loading during training](#issue-slow-data-loading-during-training)
- [Issue: Video decoding errors (`NotImplementedError` or torchcodec not found)](#issue-video-decoding-errors-notimplementederror-or-torchcodec-not-found)
---
## Overview
## Basic idea
NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning.
In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of **128**, far exceeding the typical 3264 used on smaller GPUs, which accelerates convergence and improves training throughput.
## What you'll accomplish
- Set up the Isaac GR00T environment using `uv` (fast, reproducible Python packaging)
- Verify the pre-trained base model loads and runs inference
- Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128
- Evaluate the fine-tuned model using open-loop evaluation and measure inference latency
## What to know before starting
- Familiarity with Python virtual environments
- Familiarity with PyTorch training workflows (epochs, batch size, loss curves)
- General understanding of robot manipulation concepts (actions, observations, trajectories)
## Prerequisites
- NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e)
- CUDA toolkit installed: `nvcc --version` should show CUDA 12.8+
- Git installed: `git --version`
- HuggingFace account with access token (for model and dataset downloads)
- Network access to HuggingFace, GitHub, and PyPI
- At least 30 GB of free disk space (venv + model + dataset)
## Time & risk
* **Duration:** ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation)
* **Risks:** Model download requires HuggingFace authentication; `uv sync` installs packages into a project-local `.venv`
* **Rollback:** Delete the cloned `Isaac-GR00T` directory to restore state. No system-level changes are made.
* **Last Updated:** 04/06/2026
* First Publication
## Instructions
## Step 1. Clone Isaac GR00T and install dependencies
Clone the repository and run the dGPU install script. This uses `uv` for fast, reproducible dependency management and automatically detects the aarch64 architecture:
```bash
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
```
The install script:
- Installs system dependencies (`ffmpeg`, `libaio-dev`)
- Installs `uv` if not present
- Runs `uv sync` to create a `.venv` with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8)
- Builds `torchcodec` from source on aarch64 (required for video decoding)
Activate the virtual environment:
```bash
source .venv/bin/activate
```
Verify GPU access:
```bash
CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))"
```
Expected output: `NVIDIA GB300`
> [!NOTE]
> Replace `CUDA_VISIBLE_DEVICES=1` with the index of your GB300 GPU throughout this playbook. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` to find it.
## Step 2. Set up HuggingFace authentication
```bash
export HF_TOKEN="your_huggingface_token"
```
Get a token from https://huggingface.co/settings/tokens if you don't have one.
## Step 3. Download the dataset and model
Download the LIBERO Spatial dataset and the GR00T N1.6 base model:
```bash
## Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
--repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
## Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/
## Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B
```
Verify the dataset is ready:
```bash
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
```
## Step 4. Verify the base model loads and runs
Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification:
```bash
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
--model-path nvidia/GR00T-N1.6-3B \
--dataset-path demo_data/gr1.PickNPlace \
--embodiment-tag GR1 \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8 \
--steps 32
```
You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run.
> [!NOTE]
> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. See Troubleshooting for details.
> [!NOTE]
> The base model's pretrained processor does not include the `LIBERO_PANDA` embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark.
## Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial
Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of **128** — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour.
```bash
CUDA_VISIBLE_DEVICES=1 python \
gr00t/experiment/launch_finetune.py \
--base-model-path nvidia/GR00T-N1.6-3B \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--num-gpus 1 \
--output-dir output/libero_spatial_ft \
--save-steps 500 \
--save-total-limit 5 \
--max-steps 2000 \
--global-batch-size 128 \
--learning-rate 1e-4 \
--warmup-ratio 0.05 \
--weight-decay 1e-5 \
--state-dropout-prob 0.8 \
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
--dataloader-num-workers 4
```
Training runs for **2000 steps** at batch size 128 and takes approximately 2025 minutes on the GB300.
> [!NOTE]
> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to **20,000 steps** by changing `--max-steps 20000`. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks.
**What the training flags mean:**
| Flag | Value | Purpose |
|------|-------|---------|
| `--global-batch-size` | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. |
| `--state-dropout-prob` | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. |
| `--color-jitter-params` | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. |
| `--warmup-ratio` | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). |
| `--save-steps` | 500 | Saves a checkpoint every 500 steps. |
Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the `loss` field decreasing over time. Checkpoints are saved every 500 steps to `output/libero_spatial_ft/`.
## Step 6. Evaluate the fine-tuned model
Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset:
```bash
CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--traj-ids 0 1 2 \
--action-horizon 16
```
The evaluation outputs:
- **Per-trajectory MSE and MAE** printed to the terminal
- **Average MSE** across all evaluated trajectories
- **JPEG visualizations** saved to `/tmp/open_loop_eval/` showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper)
Key things to look for in the plots:
- **Predicted trajectories** (orange line) should closely track the **ground truth** (blue line)
- **Gripper timing** — opening and closing at the correct moments
- **Lower MSE** indicates better action prediction accuracy
> [!TIP]
> Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation.
## Step 7. Run inference timing benchmark
Measure the fine-tuned model's per-step inference latency:
```bash
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
--model-path output/libero_spatial_ft/checkpoint-2000/ \
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
--embodiment-tag LIBERO_PANDA \
--traj-ids 0 \
--inference-mode pytorch \
--action-horizon 8
```
> [!NOTE]
> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. This runs inference in eager mode. See Troubleshooting for details.
The timing output breaks down into:
- **Data processing** — loading and preprocessing the observation
- **Backbone** — vision-language model forward pass
- **Action head** — diffusion transformer denoising (4 steps)
- **End-to-end** — total inference time per action chunk
In eager mode (without `torch.compile`), expect ~240 ms per step. With `torch.compile` working, expect ~38 ms per step comparable to H100.
## Step 8. Clean up
To remove the environment:
```bash
deactivate
cd ..
rm -rf Isaac-GR00T
```
Your fine-tuned checkpoints in `output/libero_spatial_ft/` are deleted with the repo. Copy them elsewhere first if you want to keep them.
## Next steps
- **Increase training steps** — Change `--max-steps` to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps).
- **Try other LIBERO suites** — Download `libero_10_no_noops`, `libero_goal_no_noops`, or `libero_object_no_noops` datasets from the `IPEC-COMMUNITY` HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%.
- **Closed-loop simulation evaluation** — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the [LIBERO evaluation guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/examples/LIBERO/README.md#evaluate-checkpoint) for server-client setup.
- **Custom embodiments** — Fine-tune GR00T on your own robot data by following the [custom embodiment guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/finetune_new_embodiment.md). Requires converting your data to LeRobot v2 format and defining a modality config.
- **Experiment with batch size** — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling `--tune-llm` or `--tune-visual` increases memory usage significantly.
## Troubleshooting
## Common Issues
### Issue: `install_deps.sh` fails building torchcodec
**Solution:**
Ensure the license confirmation env var is set:
```bash
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
```
If the build still fails, ensure FFmpeg dev libraries are installed:
```bash
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev
```
### Issue: `huggingface-cli download` fails with 401 Unauthorized
**Solution:**
Verify your HuggingFace token is set and valid:
```bash
echo $HF_TOKEN
huggingface-cli whoami
```
If the token is not set:
```bash
export HF_TOKEN="your_token_here"
```
Make sure you have accepted any required model agreements on the HuggingFace model page.
### Issue: CUDA out of memory during fine-tuning
**Solution:**
If fine-tuning fails with an OOM error at batch size 128, reduce the batch size:
```bash
--global-batch-size 64
```
Also check that no other processes are using GPU memory:
```bash
nvidia-smi
```
If you are tuning additional model components (`--tune-llm` or `--tune-visual`), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient.
### Issue: Triton/PTXAS errors about `sm_103a` during inference
**Solution:**
The bundled Triton version may not yet support SM103 (GB300). This causes errors like:
```
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
```
Disable `torch.compile` by prepending:
```bash
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
```
This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default.
### Issue: `ModuleNotFoundError: No module named 'gr00t'`
**Solution:**
The virtual environment is not activated. Run:
```bash
source .venv/bin/activate
```
Verify you are in the Isaac-GR00T directory:
```bash
pwd
## Should show: .../Isaac-GR00T
```
### Issue: Training loss is not decreasing
**Solution:**
At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps:
1. Verify the dataset was downloaded correctly and the modality config was copied:
```bash
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
```
2. Check that the correct embodiment tag is used (`LIBERO_PANDA`, not `NEW_EMBODIMENT`).
3. Try increasing the learning rate to `5e-4` for faster initial convergence on short runs.
### Issue: `nvidia-smi` shows the wrong GPU
**Solution:**
On DGX Station, the GB300 may not be device 0. Find the correct index:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Use the GB300's index with `CUDA_VISIBLE_DEVICES`:
```bash
CUDA_VISIBLE_DEVICES=1 python ...
```
### Issue: Slow data loading during training
**Solution:**
Increase the number of dataloader workers:
```bash
--dataloader-num-workers 8
```
### Issue: Video decoding errors (`NotImplementedError` or torchcodec not found)
**Solution:**
The `install_deps.sh` script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall:
```bash
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
pkg-config cmake build-essential pybind11-dev
git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
uv pip install --python .venv/bin/python . --no-build-isolation
cd - && rm -rf /tmp/torchcodec
```