dgx-spark-playbooks/nvidia/station-nanochat
2026-05-27 16:00:20 +00:00
..
assets chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
endpoint-production.yaml chore: Regenerate all playbooks 2026-05-27 16:00:20 +00:00
endpoint-test.yaml chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
overview.md chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
README.md chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00

Nanochat Training

Train a small ChatGPT-style LLM (nanochat) with tokenizer, pretraining, midtraining, and SFT on DGX Station with GB300 Ultra

Table of Contents


Overview

Basic idea

This playbook demonstrates training of nanochat on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.

The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.

What you'll accomplish

You will have a working nanochat setup that trains a small LLM and serves it for chat.

  • Environment: Docker image with PyTorch and nanochat dependencies on your DGX Station.
  • Training pipeline: Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
  • Inference: ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
  • Monitoring: W&B dashboards and nanochat/report.md with metrics and samples.

What to know before starting

  • Basic Linux command line and shell usage.
  • Familiarity with Docker and GPU containers (e.g. docker run --gpus all).
  • Optional: understanding of LLM training (tokenizer, pretraining, fine-tuning).

Prerequisites

Hardware:

  • NVIDIA DGX Station with GB300 Ultra Superchip.
  • Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
  • Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).

Software:

  • Docker with NVIDIA Container Toolkit: docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
  • Network access to download datasets (Hugging Face, FineWeb) and container images.
  • Weights & Biases account and API key.
  • Hugging Face token for evaluation datasets.

Ancillary files

All required assets are in the playbook directory nvidia/station-nanochat/assets (see the dgx-spark-playbooks repository).

  • assets/Dockerfile PyTorch NGC image plus nanochat dependencies and venv.
  • assets/setup.sh Clones nanochat, checks out the supported commit, and builds the Docker image.
  • assets/launch.sh Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
  • assets/README.md Additional detail on training stages, inference, and troubleshooting.

Time & risk

  • Estimated time: About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
  • Risk level: Medium
    • Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
    • API keys (W&B, HF) must be set or the launch script will exit.
  • Rollback: Stop containers with docker stop, remove caches under ~/.cache/nanochat (or paths in launch.sh), and run docker system prune -a if needed.
  • Last Updated: 03/02/2026
    • First Publication

Instructions

Step 1. Prerequisites and environment

This playbook is for DGX Station (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.

## Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi

Expected output should show your GPU(s) and driver version. Create a W&B account and a Hugging Face token if you do not have them.

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

Step 2. Clone the playbook and set up nanochat

Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets

From the assets directory, run the setup script. It clones nanochat, checks out the supported commit, and builds the nanochat Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).

./setup.sh

Setup may take several minutes while the image builds. Verify the image:

docker images | grep nanochat

You should see the nanochat image listed.

Step 3. Launch full training

Note

The default launch.sh uses cache directories under /nanochat_cache. If that path does not exist on your DGX Station, edit launch.sh and replace those paths with your own (e.g. $(pwd)/nanochat_cache and $(pwd)/hf_cache), and create the directories before running.

To run full training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh

This runs speedrun_full.sh inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.

Step 4. Verify and use the model

After training completes, checkpoints and the tokenizer are under ~/.cache/nanochat/ (or the cache path used in launch.sh). Run inference from the nanochat directory (e.g. assets/nanochat) on your DGX Station.

Web UI (recommended):

cd nanochat
source ../.venv/bin/activate   # if using venv from container context; otherwise use the container
python -m scripts.chat_web

Open a browser to http://<STATION_IP>:8000 where <STATION_IP> is your DGX Stations IP address.

CLI:

cd nanochat
python -m scripts.chat_cli -p "Why is the sky blue?"
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"

A full report is generated at nanochat/report.md after the run. You can also monitor training at wandb.ai under your project.

Step 5. Cleanup

To stop training early, interrupt the launch script or stop the container:

Warning

This stops the training run and any in-progress work in the container.

## If launch.sh is running: press Ctrl+C

## Or stop the container by name
docker stop $(docker ps -q --filter ancestor=nanochat)

To free disk space after training (use the same path as your cache if you set NANOCHAT_CACHE):

rm -rf ./nanochat_cache ./hf_cache
docker system prune -a

Step 6. Next steps and customization

  • Small scale run: ./launch.sh can run a lite training by following the customization guide to make changes to speedrun_station.sh. This can potentially bring down the training time.
  • Custom cache paths: Set NANOCHAT_CACHE and HF_CACHE before launching (e.g. export NANOCHAT_CACHE=/path/to/nanochat_cache) if you want cache outside the assets directory.
  • Monitoring: Use nvidia-smi and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
  • Inference: Try the web UI and CLI with different checkpoints (base, mid, sft) and prompts; see sample prompts in assets/README.md.

Troubleshooting

Symptom Cause Fix
WANDB_API_KEY is not set or HF_TOKEN is not set Required env vars not exported before launch.sh export WANDB_API_KEY=<your_key> and export HF_TOKEN=<your_token> in the same shell, then run ./launch.sh.
RuntimeError: CUDA out of memory Batch size or model too large for GPU In the training script in the cloned nanochat repo (e.g. speedrun.sh), reduce --device_batch_size (e.g. 16, 8, 4, 2, or 1).
Docker container not starting or no GPU Docker or NVIDIA runtime misconfigured Run nvidia-smi on your DGX Station. Check no other containers hold GPUs: docker ps. Test GPU in Docker: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi.
Permission denied or No such file or directory for cache paths in launch.sh Paths like /home/scratch.lramesh_dpt/... dont exist on your system Edit launch.sh: set cache dirs to paths you can create (e.g. $(pwd)/nanochat_cache, $(pwd)/hf_cache). Run mkdir -p <your_cache_dirs> and re-run launch.sh.
nanochat image not found when running launch.sh Setup not run or build failed From nvidia/nanochat/assets, run ./setup.sh and confirm with docker images (look for the nanochat image).
Training exits immediately or script doesnt wait Container fails early (missing keys, paths, or OOM) Check container logs: docker ps -a then docker logs <container_id>. Fix env vars, cache paths, or batch size as above.
Wrong cache path or "No such file" when launching launch.sh uses non-existent paths (e.g. /home/scratch...) On DGX Station, edit launch.sh: replace cache dirs with $(pwd)/nanochat_cache and $(pwd)/hf_cache, then run mkdir -p nanochat_cache hf_cache.