dgx-spark-playbooks/nvidia/station-gr00t
2026-05-26 18:25:53 +00:00
..
assets chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
endpoint-test.yaml chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
overview.md chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
README.md chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00

Isaac GR00T N1.6 Fine-Tuning

Fine-tune and benchmark NVIDIA's GR00T N1.6 robotics foundation model on DGX Station

Table of Contents


Overview

Basic idea

NVIDIA Isaac GR00T N1.6 is a 3-billion-parameter open vision-language-action (VLA) foundation model for generalist humanoid robot skills. It combines a Cosmos-Reason-2B vision-language backbone with a 32-layer Diffusion Transformer (DiT) action head that denoises continuous robot actions from multimodal input — language instructions and camera images. The model is pre-trained on over 10,000 hours of robot demonstration data spanning bimanual arms, semi-humanoid platforms, and full humanoids, then adapted to specific embodiments and tasks through fine-tuning.

In this playbook you will fine-tune GR00T N1.6 on the LIBERO Spatial benchmark — a manipulation task suite that tests spatial reasoning with a Panda robot arm. DGX Station's GB300 GPU with 284 GB of HBM3e memory enables a per-device batch size of 128, far exceeding the typical 3264 used on smaller GPUs, which accelerates convergence and improves training throughput.

What you'll accomplish

  • Set up the Isaac GR00T environment using uv (fast, reproducible Python packaging)
  • Verify the pre-trained base model loads and runs inference
  • Fine-tune GR00T N1.6 on the LIBERO Spatial dataset with batch size 128
  • Evaluate the fine-tuned model using open-loop evaluation and measure inference latency

What to know before starting

  • Familiarity with Python virtual environments
  • Familiarity with PyTorch training workflows (epochs, batch size, loss curves)
  • General understanding of robot manipulation concepts (actions, observations, trajectories)

Prerequisites

  • NVIDIA DGX Station with GB300 GPU (Blackwell SM103, 284 GB HBM3e)
  • CUDA toolkit installed: nvcc --version should show CUDA 12.8+
  • Git installed: git --version
  • HuggingFace account with access token (for model and dataset downloads)
  • Network access to HuggingFace, GitHub, and PyPI
  • At least 30 GB of free disk space (venv + model + dataset)

Time & risk

  • Duration: ~45 minutes (5 min setup, 5 min dataset download, 25 min fine-tuning at 2000 steps, 5 min evaluation)
  • Risks: Model download requires HuggingFace authentication; uv sync installs packages into a project-local .venv
  • Rollback: Delete the cloned Isaac-GR00T directory to restore state. No system-level changes are made.
  • Last Updated: 04/06/2026
    • First Publication

Instructions

Step 1. Clone Isaac GR00T and install dependencies

Clone the repository and run the dGPU install script. This uses uv for fast, reproducible dependency management and automatically detects the aarch64 architecture:

git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh

The install script:

  • Installs system dependencies (ffmpeg, libaio-dev)
  • Installs uv if not present
  • Runs uv sync to create a .venv with all Python dependencies (PyTorch 2.7.1 + CUDA 12.8)
  • Builds torchcodec from source on aarch64 (required for video decoding)

Activate the virtual environment:

source .venv/bin/activate

Verify GPU access:

CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_name(0))"

Expected output: NVIDIA GB300

Note

Replace CUDA_VISIBLE_DEVICES=1 with the index of your GB300 GPU throughout this playbook. Run nvidia-smi --query-gpu=index,name --format=csv,noheader to find it.

Step 2. Set up HuggingFace authentication

export HF_TOKEN="your_huggingface_token"

Get a token from https://huggingface.co/settings/tokens if you don't have one.

Step 3. Download the dataset and model

Download the LIBERO Spatial dataset and the GR00T N1.6 base model:

## Download LIBERO Spatial dataset (~2-3 GB)
huggingface-cli download \
    --repo-type dataset IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot \
    --local-dir examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/

## Copy the LIBERO modality config into the dataset's meta/ directory
cp examples/LIBERO/modality.json \
    examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/

## Download GR00T N1.6 base model (~6 GB)
huggingface-cli download nvidia/GR00T-N1.6-3B

Verify the dataset is ready:

ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json

Step 4. Verify the base model loads and runs

Confirm the GR00T N1.6 base model loads correctly and can produce actions. The base model ships with a GR1 demo dataset for quick verification:

CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
    --model-path nvidia/GR00T-N1.6-3B \
    --dataset-path demo_data/gr1.PickNPlace \
    --embodiment-tag GR1 \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8 \
    --steps 32

You should see per-step timing output and no errors. This confirms the model, CUDA, and data pipeline all work before committing to a longer fine-tuning run.

Note

If you see Triton/PTXAS errors about sm_103a, prepend TORCHDYNAMO_DISABLE=1 to the command. See Troubleshooting for details.

Note

The base model's pretrained processor does not include the LIBERO_PANDA embodiment configuration, so you cannot run evaluation directly on the LIBERO dataset with the base model. The LIBERO modality config is registered during fine-tuning. This is expected — LIBERO is a post-training benchmark.

Step 5. Fine-tune GR00T N1.6 on LIBERO Spatial

Fine-tune the base model on the LIBERO Spatial dataset. DGX Station's GB300 GPU with 284 GB HBM3e lets you use a batch size of 128 — roughly 4x what fits on a typical 80 GB GPU. Larger batch sizes mean more stable gradients and faster convergence per wall-clock hour.

CUDA_VISIBLE_DEVICES=1 python \
    gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --num-gpus 1 \
    --output-dir output/libero_spatial_ft \
    --save-steps 500 \
    --save-total-limit 5 \
    --max-steps 2000 \
    --global-batch-size 128 \
    --learning-rate 1e-4 \
    --warmup-ratio 0.05 \
    --weight-decay 1e-5 \
    --state-dropout-prob 0.8 \
    --color-jitter-params brightness 0.3 contrast 0.4 saturation 0.5 hue 0.08 \
    --dataloader-num-workers 4

Training runs for 2000 steps at batch size 128 and takes approximately 2025 minutes on the GB300.

Note

This playbook uses 2000 steps to keep execution time under an hour. For production-quality results matching the published 97.65% success rate on LIBERO Spatial, increase to 20,000 steps by changing --max-steps 20000. Published results used batch size 640 across 8 GPUs (80 per GPU) — batch size 128 on a single GB300 exceeds the per-GPU batch size used in the published benchmarks.

What the training flags mean:

Flag Value Purpose
--global-batch-size 128 Total samples per training step. GB300's 284 GB HBM3e makes this possible on a single GPU.
--state-dropout-prob 0.8 Drops state input 80% of the time during training, forcing the model to rely on vision. Improves generalization.
--color-jitter-params brightness/contrast/saturation/hue Randomly perturbs image colors during training for robustness to lighting variation.
--warmup-ratio 0.05 Linearly ramps learning rate from 0 to 1e-4 over the first 5% of steps (100 steps).
--save-steps 500 Saves a checkpoint every 500 steps.

Monitor the training loss in the terminal. The HuggingFace Trainer logs progress at each step — look for the loss field decreasing over time. Checkpoints are saved every 500 steps to output/libero_spatial_ft/.

Step 6. Evaluate the fine-tuned model

Run open-loop evaluation on the fine-tuned checkpoint. This compares the model's predicted actions against the ground truth from the dataset:

CUDA_VISIBLE_DEVICES=1 python gr00t/eval/open_loop_eval.py \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --traj-ids 0 1 2 \
    --action-horizon 16

The evaluation outputs:

  • Per-trajectory MSE and MAE printed to the terminal
  • Average MSE across all evaluated trajectories
  • JPEG visualizations saved to /tmp/open_loop_eval/ showing ground truth vs. predicted actions for each action dimension (x, y, z, roll, pitch, yaw, gripper)

Key things to look for in the plots:

  • Predicted trajectories (orange line) should closely track the ground truth (blue line)
  • Gripper timing — opening and closing at the correct moments
  • Lower MSE indicates better action prediction accuracy

Tip

Even at 2000 steps, the fine-tuned model should show clearly improved action prediction compared to random. With 20,000 steps, LIBERO Spatial achieves 97.65% success rate in closed-loop simulation.

Step 7. Run inference timing benchmark

Measure the fine-tuned model's per-step inference latency:

CUDA_VISIBLE_DEVICES=1 python scripts/deployment/standalone_inference_script.py \
    --model-path output/libero_spatial_ft/checkpoint-2000/ \
    --dataset-path examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/ \
    --embodiment-tag LIBERO_PANDA \
    --traj-ids 0 \
    --inference-mode pytorch \
    --action-horizon 8

Note

If you see Triton/PTXAS errors about sm_103a, prepend TORCHDYNAMO_DISABLE=1 to the command. This runs inference in eager mode. See Troubleshooting for details.

The timing output breaks down into:

  • Data processing — loading and preprocessing the observation
  • Backbone — vision-language model forward pass
  • Action head — diffusion transformer denoising (4 steps)
  • End-to-end — total inference time per action chunk

In eager mode (without torch.compile), expect ~240 ms per step. With torch.compile working, expect ~38 ms per step comparable to H100.

Step 8. Clean up

To remove the environment:

deactivate
cd ..
rm -rf Isaac-GR00T

Your fine-tuned checkpoints in output/libero_spatial_ft/ are deleted with the repo. Copy them elsewhere first if you want to keep them.

Next steps

  • Increase training steps — Change --max-steps to 20000 for results closer to the published 97.65% success rate on LIBERO Spatial. Training time scales linearly (~3.5 hours at 20K steps).
  • Try other LIBERO suites — Download libero_10_no_noops, libero_goal_no_noops, or libero_object_no_noops datasets from the IPEC-COMMUNITY HuggingFace organization and repeat the workflow. Published success rates: Object 98.45%, Goal 97.5%, 10-Long 94.35%.
  • Closed-loop simulation evaluation — Set up the LIBERO simulation environment to test the fine-tuned model in a live control loop with the Panda robot arm. See the LIBERO evaluation guide for server-client setup.
  • Custom embodiments — Fine-tune GR00T on your own robot data by following the custom embodiment guide. Requires converting your data to LeRobot v2 format and defining a modality config.
  • Experiment with batch size — The GB300's 284 GB HBM3e may support even larger batch sizes depending on which model components are being tuned. The default configuration tunes the projector and diffusion model only. Enabling --tune-llm or --tune-visual increases memory usage significantly.

Troubleshooting

Common Issues

Issue: install_deps.sh fails building torchcodec

Solution:

Ensure the license confirmation env var is set:

I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 bash scripts/deployment/dgpu/install_deps.sh

If the build still fails, ensure FFmpeg dev libraries are installed:

sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
    libavcodec-dev libavutil-dev libswresample-dev libswscale-dev

Issue: huggingface-cli download fails with 401 Unauthorized

Solution:

Verify your HuggingFace token is set and valid:

echo $HF_TOKEN
huggingface-cli whoami

If the token is not set:

export HF_TOKEN="your_token_here"

Make sure you have accepted any required model agreements on the HuggingFace model page.

Issue: CUDA out of memory during fine-tuning

Solution:

If fine-tuning fails with an OOM error at batch size 128, reduce the batch size:

--global-batch-size 64

Also check that no other processes are using GPU memory:

nvidia-smi

If you are tuning additional model components (--tune-llm or --tune-visual), these significantly increase memory usage. The default configuration (projector + diffusion model only) is the most memory-efficient.

Issue: Triton/PTXAS errors about sm_103a during inference

Solution:

The bundled Triton version may not yet support SM103 (GB300). This causes errors like:

ptxas-blackwell fatal: Value 'sm_103a' is not defined for option 'gpu-name'

Disable torch.compile by prepending:

TORCHDYNAMO_DISABLE=1 python scripts/deployment/standalone_inference_script.py ...

This runs inference in eager mode (~240 ms/step instead of ~38 ms/step with compile). Training and open-loop evaluation are not affected since they use eager mode by default.

Issue: ModuleNotFoundError: No module named 'gr00t'

Solution:

The virtual environment is not activated. Run:

source .venv/bin/activate

Verify you are in the Isaac-GR00T directory:

pwd
## Should show: .../Isaac-GR00T

Issue: Training loss is not decreasing

Solution:

At 2000 steps, the model may not have converged fully — this is expected for the shortened playbook run. If loss remains flat after 500+ steps:

  1. Verify the dataset was downloaded correctly and the modality config was copied:

    ls examples/LIBERO/libero_spatial_no_noops_1.0.0_lerobot/meta/modality.json
    
  2. Check that the correct embodiment tag is used (LIBERO_PANDA, not NEW_EMBODIMENT).

  3. Try increasing the learning rate to 5e-4 for faster initial convergence on short runs.

Issue: nvidia-smi shows the wrong GPU

Solution:

On DGX Station, the GB300 may not be device 0. Find the correct index:

nvidia-smi --query-gpu=index,name --format=csv,noheader

Use the GB300's index with CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES=1 python ...

Issue: Slow data loading during training

Solution:

Increase the number of dataloader workers:

--dataloader-num-workers 8

Issue: Video decoding errors (NotImplementedError or torchcodec not found)

Solution:

The install_deps.sh script builds torchcodec from source on aarch64. If it wasn't built correctly, reinstall:

sudo apt-get install -y libavdevice-dev libavfilter-dev libavformat-dev \
    libavcodec-dev libavutil-dev libswresample-dev libswscale-dev \
    pkg-config cmake build-essential pybind11-dev

git clone --depth 1 --branch release/0.4 https://github.com/meta-pytorch/torchcodec.git /tmp/torchcodec
cd /tmp/torchcodec
I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1 ENABLE_CUDA=1 \
    uv pip install --python .venv/bin/python . --no-build-isolation
cd - && rm -rf /tmp/torchcodec