mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-21 21:59:30 +00:00
473 lines
20 KiB
YAML
473 lines
20 KiB
YAML
kind: Playbook
|
||
metadata:
|
||
name: station-gr00t
|
||
displayName: Isaac GR00T N1.6 Fine-Tuning
|
||
shortDescription: Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station
|
||
|
||
publisher: nvidia
|
||
description: |
|
||
# REPLACE THIS WITH YOUR MODEL CARD
|
||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||
|
||
labelsV2:
|
||
- gpuType:playbook:gpu_type_station
|
||
- DGX Station
|
||
- GB300
|
||
- Robotics
|
||
- Isaac GR00T
|
||
- Fine-Tuning
|
||
- Blackwell
|
||
- VLA
|
||
|
||
attributes:
|
||
- key: DURATION
|
||
value: 45 MIN
|
||
|
||
spec:
|
||
artifactName: station-gr00t
|
||
nvcfFunctionId: None
|
||
attributes:
|
||
|
||
showUnavailableBanner: false
|
||
apiDocsUrl: None
|
||
termsOfUse: |
|
||
|
||
cta:
|
||
text: View on GitHub
|
||
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-gr00t/
|
||
|
||
|
||
tabs:
|
||
-
|
||
id: overview
|
||
|
||
label: Overview
|
||
content: |
|
||
# Basic idea
|
||
|
||
NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning.
|
||
|
||
In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of **128**, far exceeding the typical 32–64 used on smaller GPUs, which accelerates convergence and improves training throughput.
|
||
|
||
# What you'll accomplish
|
||
|
||
- Set up the Isaac GR00T environment using `uv` (fast, reproducible Python packaging)
|
||
- Verify the pre-trained base model loads and runs inference
|
||
- Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128
|
||
- Evaluate the fine-tuned model using open-loop evaluation and measure inference latency
|
||
|
||
# What to know before starting
|
||
|
||
- Familiarity with Python virtual environments
|
||
- Familiarity with PyTorch training workflows (epochs, batch size, loss curves)
|
||
- General understanding of robot manipulation concepts (actions, observations, trajectories)
|
||
|
||
# Prerequisites
|
||
|
||
- NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e)
|
||
- CUDA toolkit installed: `nvcc --version` should show CUDA 12.8+
|
||
- Git installed: `git --version`
|
||
- HuggingFace account with access token (for model and dataset downloads)
|
||
- Network access to HuggingFace, GitHub, and PyPI
|
||
- At least 30 GB of free disk space (venv + model + dataset)
|
||
|
||
# Time & risk
|
||
|
||
* **Duration:** ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation)
|
||
* **Risks:** Model download requires HuggingFace authentication; `uv sync` installs packages into a project-local `.venv`
|
||
* **Rollback:** Delete the cloned `Isaac-GR00T` directory to restore state. No system-level changes are made.
|
||
* **Last Updated:** 04/06/2026
|
||
* First Publication
|
||
|
||
|
||
|
||
-
|
||
id: instructions
|
||
|
||
label: Instructions
|
||
content: |
|
||
# Step 1. Clone Isaac GR00T and install dependencies
|
||
|
||
Clone the repository and run the dGPU install script. This uses `uv` for fast, reproducible dependency management and automatically detects the aarch64 architecture:
|
||
|
||
```bash
|
||
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
|
||
cd Isaac-GR00T
|
||
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
|
||
```
|
||
|
||
The install script:
|
||
- Installs system dependencies (`ffmpeg`, `libaio-dev`)
|
||
- Installs `uv` if not present
|
||
- Runs `uv sync` to create a `.venv` with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8)
|
||
- Builds `torchcodec` from source on aarch64 (required for video decoding)
|
||
|
||
Activate the virtual environment:
|
||
|
||
```bash
|
||
source .venv/bin/activate
|
||
```
|
||
|
||
Verify GPU access:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))"
|
||
```
|
||
|
||
Expected output: `NVIDIA GB300`
|
||
|
||
> [!NOTE]
|
||
> Replace `CUDA_VISIBLE_DEVICES=1` with the index of your GB300 GPU throughout this playbook. Run `nvidia-smi --query-gpu=index,name --format=csv,noheader` to find it.
|
||
|
||
# Step 2. Set up HuggingFace authentication
|
||
|
||
```bash
|
||
export HF_TOKEN="your_huggingface_token"
|
||
```
|
||
|
||
Get a token from https://huggingface.co/settings/tokens if you don't have one.
|
||
|
||
# Step 3. Download the dataset and model
|
||
|
||
Download the LIBERO Spatial dataset and the GR00T N1.6 base model:
|
||
|
||
```bash
|
||
# Download LIBERO Spatial dataset (~2-3 GB)
|
||
huggingface-cli download \
|
||
--repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
|
||
--local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/
|
||
|
||
# Copy the LIBERO modality config into the dataset's meta/ directory
|
||
cp examples/LIBERO/modality.json \
|
||
examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/
|
||
|
||
# Download GR00T N1.6 base model (~6 GB)
|
||
huggingface-cli download nvidia/GR00T-N1.6-3B
|
||
```
|
||
|
||
Verify the dataset is ready:
|
||
|
||
```bash
|
||
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
|
||
```
|
||
|
||
# Step 4. Verify the base model loads and runs
|
||
|
||
Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
|
||
--model-path nvidia/GR00T-N1.6-3B \
|
||
--dataset-path demo_data/gr1.PickNPlace \
|
||
--embodiment-tag GR1 \
|
||
--traj-ids 0 \
|
||
--inference-mode pytorch \
|
||
--action-horizon 8 \
|
||
--steps 32
|
||
```
|
||
|
||
You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run.
|
||
|
||
> [!NOTE]
|
||
> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. See Troubleshooting for details.
|
||
|
||
> [!NOTE]
|
||
> The base model's pretrained processor does not include the `LIBERO_PANDA` embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark.
|
||
|
||
# Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial
|
||
|
||
Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of **128** — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour.
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=1 python \
|
||
gr00t/experiment/launch_finetune.py \
|
||
--base-model-path nvidia/GR00T-N1.6-3B \
|
||
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
|
||
--embodiment-tag LIBERO_PANDA \
|
||
--num-gpus 1 \
|
||
--output-dir output/libero_spatial_ft \
|
||
--save-steps 500 \
|
||
--save-total-limit 5 \
|
||
--max-steps 2000 \
|
||
--global-batch-size 128 \
|
||
--learning-rate 1e-4 \
|
||
--warmup-ratio 0.05 \
|
||
--weight-decay 1e-5 \
|
||
--state-dropout-prob 0.8 \
|
||
--color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
|
||
--dataloader-num-workers 4
|
||
```
|
||
|
||
Training runs for **2000 steps** at batch size 128 and takes approximately 20–25 minutes on the GB300.
|
||
|
||
> [!NOTE]
|
||
> This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to **20,000 steps** by changing `--max-steps 20000`. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks.
|
||
|
||
**What the training flags mean:**
|
||
|
||
| Flag | Value | Purpose |
|
||
|------|-------|---------|
|
||
| `--global-batch-size` | 128 | Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU. |
|
||
| `--state-dropout-prob` | 0.8 | Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization. |
|
||
| `--color-jitter-params` | brightness/contrast/saturation/hue | Randomly perturbs image colors during training for robustness to lighting variation. |
|
||
| `--warmup-ratio` | 0.05 | Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps). |
|
||
| `--save-steps` | 500 | Saves a checkpoint every 500 steps. |
|
||
|
||
Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the `loss` field decreasing over time. Checkpoints are saved every 500 steps to `output/libero_spatial_ft/`.
|
||
|
||
# Step 6. Evaluate the fine-tuned model
|
||
|
||
Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \
|
||
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
|
||
--embodiment-tag LIBERO_PANDA \
|
||
--model-path output/libero_spatial_ft/checkpoint-2000/ \
|
||
--traj-ids 0 1 2 \
|
||
--action-horizon 16
|
||
```
|
||
|
||
The evaluation outputs:
|
||
|
||
- **Per-trajectory MSE and MAE** printed to the terminal
|
||
- **Average MSE** across all evaluated trajectories
|
||
- **JPEG visualizations** saved to `/tmp/open_loop_eval/` showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper)
|
||
|
||
Key things to look for in the plots:
|
||
|
||
- **Predicted trajectories** (orange line) should closely track the **ground truth** (blue line)
|
||
- **Gripper timing** — opening and closing at the correct moments
|
||
- **Lower MSE** indicates better action prediction accuracy
|
||
|
||
> [!TIP]
|
||
> Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation.
|
||
|
||
# Step 7. Run inference timing benchmark
|
||
|
||
Measure the fine-tuned model's per-step inference latency:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
|
||
--model-path output/libero_spatial_ft/checkpoint-2000/ \
|
||
--dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
|
||
--embodiment-tag LIBERO_PANDA \
|
||
--traj-ids 0 \
|
||
--inference-mode pytorch \
|
||
--action-horizon 8
|
||
```
|
||
|
||
> [!NOTE]
|
||
> If you see Triton/PTXAS errors about `sm_103a`, prepend `TORCHDYNAMO_DISABLE=1` to the command. This runs inference in eager mode. See Troubleshooting for details.
|
||
|
||
The timing output breaks down into:
|
||
|
||
- **Data processing** — loading and preprocessing the observation
|
||
- **Backbone** — vision-language model forward pass
|
||
- **Action head** — diffusion transformer denoising (4 steps)
|
||
- **End-to-end** — total inference time per action chunk
|
||
|
||
In eager mode (without `torch.compile`), expect ~240 ms per step. With `torch.compile` working, expect ~38 ms per step comparable to H100.
|
||
|
||
# Step 8. Clean up
|
||
|
||
To remove the environment:
|
||
|
||
```bash
|
||
deactivate
|
||
cd ..
|
||
rm -rf Isaac-GR00T
|
||
```
|
||
|
||
Your fine-tuned checkpoints in `output/libero_spatial_ft/` are deleted with the repo. Copy them elsewhere first if you want to keep them.
|
||
|
||
# Next steps
|
||
|
||
- **Increase training steps** — Change `--max-steps` to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps).
|
||
- **Try other LIBERO suites** — Download `libero_10_no_noops`, `libero_goal_no_noops`, or `libero_object_no_noops` datasets from the `IPEC-COMMUNITY` HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%.
|
||
- **Closed-loop simulation evaluation** — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the [LIBERO evaluation guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/examples/LIBERO/README.md#evaluate-checkpoint) for server-client setup.
|
||
- **Custom embodiments** — Fine-tune GR00T on your own robot data by following the [custom embodiment guide](https://github.com/NVIDIA/Isaac-GR00T/blob/main/getting_started/finetune_new_embodiment.md). Requires converting your data to LeRobot v2 format and defining a modality config.
|
||
- **Experiment with batch size** — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling `--tune-llm` or `--tune-visual` increases memory usage significantly.
|
||
|
||
|
||
|
||
-
|
||
id: troubleshooting
|
||
|
||
label: Troubleshooting
|
||
content: |
|
||
# Common Issues
|
||
|
||
## Issue: `install_deps.sh` fails building torchcodec
|
||
|
||
**Solution:**
|
||
|
||
Ensure the license confirmation env var is set:
|
||
|
||
```bash
|
||
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh
|
||
```
|
||
|
||
If the build still fails, ensure FFmpeg dev libraries are installed:
|
||
|
||
```bash
|
||
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
|
||
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev
|
||
```
|
||
|
||
## Issue: `huggingface-cli download` fails with 401 Unauthorized
|
||
|
||
**Solution:**
|
||
|
||
Verify your HuggingFace token is set and valid:
|
||
|
||
```bash
|
||
echo $HF_TOKEN
|
||
huggingface-cli whoami
|
||
```
|
||
|
||
If the token is not set:
|
||
|
||
```bash
|
||
export HF_TOKEN="your_token_here"
|
||
```
|
||
|
||
Make sure you have accepted any required model agreements on the HuggingFace model page.
|
||
|
||
## Issue: CUDA out of memory during fine-tuning
|
||
|
||
**Solution:**
|
||
|
||
If fine-tuning fails with an OOM error at batch size 128, reduce the batch size:
|
||
|
||
```bash
|
||
--global-batch-size 64
|
||
```
|
||
|
||
Also check that no other processes are using GPU memory:
|
||
|
||
```bash
|
||
nvidia-smi
|
||
```
|
||
|
||
If you are tuning additional model components (`--tune-llm` or `--tune-visual`), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient.
|
||
|
||
## Issue: Triton/PTXAS errors about `sm_103a` during inference
|
||
|
||
**Solution:**
|
||
|
||
The bundled Triton version may not yet support SM103 (GB300). This causes errors like:
|
||
|
||
```
|
||
ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'
|
||
```
|
||
|
||
Disable `torch.compile` by prepending:
|
||
|
||
```bash
|
||
TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...
|
||
```
|
||
|
||
This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default.
|
||
|
||
## Issue: `ModuleNotFoundError: No module named 'gr00t'`
|
||
|
||
**Solution:**
|
||
|
||
The virtual environment is not activated. Run:
|
||
|
||
```bash
|
||
source .venv/bin/activate
|
||
```
|
||
|
||
Verify you are in the Isaac-GR00T directory:
|
||
|
||
```bash
|
||
pwd
|
||
# Should show: .../Isaac-GR00T
|
||
```
|
||
|
||
## Issue: Training loss is not decreasing
|
||
|
||
**Solution:**
|
||
|
||
At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps:
|
||
|
||
1. Verify the dataset was downloaded correctly and the modality config was copied:
|
||
```bash
|
||
ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
|
||
```
|
||
|
||
2. Check that the correct embodiment tag is used (`LIBERO_PANDA`, not `NEW_EMBODIMENT`).
|
||
|
||
3. Try increasing the learning rate to `5e-4` for faster initial convergence on short runs.
|
||
|
||
## Issue: `nvidia-smi` shows the wrong GPU
|
||
|
||
**Solution:**
|
||
|
||
On DGX Station, the GB300 may not be device 0. Find the correct index:
|
||
|
||
```bash
|
||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||
```
|
||
|
||
Use the GB300's index with `CUDA_VISIBLE_DEVICES`:
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=1 python ...
|
||
```
|
||
|
||
## Issue: Slow data loading during training
|
||
|
||
**Solution:**
|
||
|
||
Increase the number of dataloader workers:
|
||
|
||
```bash
|
||
--dataloader-num-workers 8
|
||
```
|
||
|
||
## Issue: Video decoding errors (`NotImplementedError` or torchcodec not found)
|
||
|
||
**Solution:**
|
||
|
||
The `install_deps.sh` script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall:
|
||
|
||
```bash
|
||
sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
|
||
libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
|
||
pkg-config cmake build-essential pybind11-dev
|
||
|
||
git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec
|
||
cd /tmp/torchcodec
|
||
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
|
||
uv pip install --python .venv/bin/python . --no-build-isolation
|
||
cd - && rm -rf /tmp/torchcodec
|
||
```
|
||
|
||
|
||
|
||
|
||
resources:
|
||
- name: Isaac GR00T (GitHub)
|
||
url: https://github.com/NVIDIA/Isaac-GR00T
|
||
|
||
|
||
- name: GR00T N1.6 Model (HuggingFace)
|
||
url: https://huggingface.co/nvidia/GR00T-N1.6-3B
|
||
|
||
|
||
- name: GR00T N1.6 Research Blog
|
||
url: https://research.nvidia.com/labs/gear/gr00t-n1_6/
|
||
|
||
|
||
- name: GR00T N1.6 Paper
|
||
url: https://arxiv.org/abs/2503.14734
|
||
|
||
|
||
- name: LIBERO Benchmark
|
||
url: https://libero-project.github.io/main.html
|
||
|
||
|