dgx-spark-playbooks/nvidia/station-nanochat/assets
2026-06-02 18:47:24 +00:00
..
Dockerfile chore: Regenerate all playbooks 2026-06-02 18:47:24 +00:00
launch.sh chore: Regenerate all playbooks 2026-05-29 15:56:45 +00:00
README.md chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
setup.sh chore: Regenerate all playbooks 2026-05-29 15:56:45 +00:00
speedrun_station.sh chore: Regenerate all playbooks 2026-05-29 15:56:45 +00:00

Nanochat Training on DGX Station

This project demonstrates training of nanochat, "the best ChatGPT that $100 can buy," on DGX Station. The demo includes tokenization, pretraining, midtraining, supervised fine-tuning (SFT), and inference through both CLI and web UI.

Overview

The project includes:

  • Full LLM Pipeline: Tokenizer training, pretraining, midtraining, and SFT
  • Custom Tokenizer: BPE tokenizer with 65K vocabulary trained on FineWeb
  • Evaluation Suite: CORE, ARC, GSM8K, HumanEval, MMLU benchmarks
  • Interactive Inference: Chat with your model via CLI or web UI
  • Docker Support: Complete containerized environment with PyTorch NGC

Contents

  1. Environment Setup
  2. Preparation
  3. Training
  4. Customization
  5. Inference
  6. Evaluation Results
  7. Architecture Details

1. Environment Setup

1.1 Prerequisites

Before starting, ensure you have:

  • DGX Station with driver and CUDA toolkit setup
  • Docker installed on the system
  • Huggingface and WandB API access

1.4 Enviornment Setup

For training visualization and logging, set up your W&B API key. If you don't have a W&B account, you can create one at wandb.ai. Additionally, a Huggingface token will be required for downloading certain datasets for model evaluation. Likewise, you can create a HF token by following the instructions at (huggingface.co](https://huggingface.co/docs/hub/en/security-tokens).

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

2. Preparation

2.1 Clone the repository

Clone the current repository and change directories to the station-nanochat repository.

cd station-nanochat

2.2 Nanochat Setup

After navigating to the assets folder and run the setup script to clone nanochat and build the Docker image on both nodes.

sh setup.sh

The setup script will:

  • Clone the nanochat repository
  • Copy the modified speedrun_station.sh script for training on station
  • Build a custom Docker image for nanochat

Verify your directory structure after setup:

station-nanochat/assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
    ├── README.md
    ├── speedrun.sh (replaced with speedrun_station.sh)
    ├── scripts/
    ├── nanochat/
    └── ...

2.3 Verify Docker Image

Ensure the Docker image was built successfully on both nodes:

# On your system
docker images | grep nanochat

You should see the nanochat image listed on your system.

3. Training

3.1 Launch Training

Start training on DGX Station:

# Ensure that the previous environment variables are exported
# Launch training on both nodes
./ launch.sh 

The training script will automatically:

  1. Download ~24GB of FineWeb pretraining data
  2. Train a BPE tokenizer with 65K vocabulary
  3. Pretrain a 561M parameter Transformer model (d20)
  4. Run midtraining to teach conversation format
  5. Fine-tune with supervised learning (SFT)
  6. Generate evaluation reports

3.2 Training Duration

Expected training time on station:

  • Speedrun (d20): Upto 16 hours for 561M parameter model

The training uses PyTorch with:

  • Model Architecture: GPT-style Transformer with 20 layers
  • Parameters: 561M
  • Training Tokens: ~11.2B tokens (Chinchilla optimal)
  • Optimizer: Muon for pretraining, AdamW for finetuning
  • Precision: Mixed precision (bfloat16)

3.3 Monitoring Training

To view the training progress via W&B, monitor any stage of nanochat training at:

https://wandb.ai/<your-username>/projects

Track key metrics:

  • Training loss: Should decrease from ~3.5 to ~2.5
  • Validation loss: Monitor for overfitting
  • Learning rate: Follows cosine decay schedule
  • Throughput: Tokens processed per second

3.4 Training Stages

The training pipeline consists of:

Stage 1: Tokenizer Training

  • Downloads 2B characters from FineWeb dataset
  • Trains BPE tokenizer with 65,536 vocabulary size
  • Achieves ~4.8 characters per token compression

Stage 2: Base Model Pretraining

  • Downloads 240 data shards (~24GB) from FineWeb
  • Pretrains d20 model (561M params) on 11.2B tokens
  • Evaluates on CORE benchmark (DCLM paper metrics)

Stage 3: Midtraining

  • Introduces conversation special tokens (<|im_start|>, <|im_end|>)
  • Trains on synthetic identity conversations
  • Teaches model chat format and basic personality

Stage 4: Supervised Fine-tuning (SFT)

  • Fine-tunes on SmolTalk dataset
  • Improves conversation quality and instruction following
  • Final model ready for chat inference

3.5 Checkpoints

Training checkpoints are automatically saved in ~/.cache/nanochat/:

  • model_base.pt: Pretrained base model
  • model_mid.pt: After midtraining
  • model_sft.pt: Final fine-tuned model
  • tokenizer.model: Trained BPE tokenizer

4. Customization

For faster experimentation or testing the distributed setup, you can train a smaller model. This is useful for validating your infrastructure and workflow before committing to the full 5-day training run.

4.1 Remove existing nanochat installation

If you have previously run the setup, first remove the nanochat folder:

# From the assets directory
rm -rf nanochat

4.2 Modify speedrun_station.sh for minimal configuration

Edit speedrun_station.sh to use a smaller model configuration:

# Reduce data shards (50 shards instead of 240 for quick testing)
python -m nanochat.dataset -n 50 &

# Use depth=4 for minimal training
python -m scripts.base_train --depth=4 --device_batch_size=32

5. Inference

After training completes, you can interact with your trained model through multiple interfaces.

Launch the ChatGPT-style web interface:


# Activate the virtual environment
cd nanochat
source ../.venv/bin/activate

# Start the web server
python -m scripts.chat_web

Access the UI at:

http://<SYSTEM_IP>:8000

The web UI provides:

  • ChatGPT-style conversation interface
  • Message history
  • Real-time streaming responses
  • Clean, modern design

5.2 Command Line Interface

For quick interactions via terminal:

# Interactive chat mode
python -m scripts.chat_cli

# Single prompt mode
python -m scripts.chat_cli -p "Why is the sky blue?"

# Specify checkpoint (base, mid, or sft)
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"

5.3 Sample Prompts

Try these prompts to test your model:

Reasoning:

Why do astronauts float in space?

Math:

A model trains for 3 epochs. Each epoch has 1000 steps and each step takes 0.5 seconds. How many minutes does the full training take?

Code:

Write a Python function to calculate fibonacci numbers

Note: The d20 speedrun model (561M parameters, ~4e19 FLOPs) is intentionally designed as a compact educational demonstration and has significant limitations. Expected behaviors include factual inaccuracies, hallucinations, and inconsistent reasoning. These characteristics are inherent to models trained with limited parameters and compute resources.

6. Evaluation Results

6.1 Training Report

After training completes, a comprehensive report is generated at nanochat/report.md. View it with:

cat nanochat/report.md

The report includes:

  • System information and training configuration
  • Training curves and loss plots
  • Evaluation metrics across all benchmarks
  • Sample generations at each training stage
  • Total training time and cost breakdown

7. Architecture Details

7.1 Model Configuration (d20)

Layers: 20
Embedding Dimension: 1024
Attention Heads: 16
Context Length: 1024 tokens
Vocabulary Size: 65,536
Total Parameters: 561M

4.3 Re-run setup script

After modifying speedrun_station.sh, run the setup script again to deploy the changes to both nodes:


# Run setup to clone nanochat and build Docker images
sh setup.sh 

Then proceed with Section 3 (Training) to launch training.

Troubleshooting

Common Issues

Issue: Out of memory (OOM) errors

RuntimeError: CUDA out of memory

Solution: Reduce batch size in the training scripts:

--device_batch_size=16  # or 8, 4, 2, 1

Issue: Docker container not starting Solution:

  • Check GPU availability: nvidia-smi
  • Ensure no other containers using GPUs: docker ps
  • Verify Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Cleanup

Stop Training

To stop training early, interrupt both containers:

# From the terminal running launch.sh
Ctrl+C

# Or manually stop containers
docker stop nanochat

Clear Cache

To free up disk space after training:

# On both nodes
rm -rf ~/.cache/nanochat

# Clear Docker system
docker system prune -a

Credits

This project is built upon:

  • nanochat by Andrej Karpathy - The base LLM training framework
  • nanoGPT by Andrej Karpathy - Inspiration for minimal LLM training
  • modded-nanoGPT by Keller Jordan - Optimized training techniques
  • FineWeb by HuggingFace - Pretraining dataset
  • SmolTalk by HuggingFace - Finetuning dataset

License

MIT - See nanochat repository for full license details.


Note: This is an educational project demonstrating distributed LLM training. The resulting model is a micro-model suitable for learning but not production use. For state-of-the-art performance, consider using pre-trained models like GPT-4, Claude, or Llama.