This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |