# Nanochat Training on DGX Station

This project demonstrates training of [nanochat](https://github.com/karpathy/nanochat), "the best ChatGPT that $100 can buy," on DGX Station. The demo includes tokenization, pretraining, midtraining, supervised fine-tuning (SFT), and inference through both CLI and web UI.

## Overview

The project includes:
- **Full LLM Pipeline**: Tokenizer training, pretraining, midtraining, and SFT
- **Custom Tokenizer**: BPE tokenizer with 65K vocabulary trained on FineWeb
- **Evaluation Suite**: CORE, ARC, GSM8K, HumanEval, MMLU benchmarks
- **Interactive Inference**: Chat with your model via CLI or web UI
- **Docker Support**: Complete containerized environment with PyTorch NGC

## Contents
1. [Environment Setup](#1-environment-setup)
2. [Preparation](#2-preparation)
3. [Training](#3-training)
4. [Customization](#4-customization)
5. [Inference](#5-inference)
6. [Evaluation Results](#6-evaluation-results)
7. [Architecture Details](#7-architecture-details)

## 1. Environment Setup

### 1.1 Prerequisites

Before starting, ensure you have:
- DGX Station with driver and CUDA toolkit setup
- Docker installed on the system
- Huggingface and WandB API access

### 1.4 Enviornment Setup

For training visualization and logging, set up your W&B API key.  If you don't have a W&B account, you can create one at [wandb.ai](https://wandb.ai/). Additionally, a Huggingface token will be required for downloading certain datasets for model evaluation. Likewise, you can create a HF token by following the instructions at (huggingface.co](https://huggingface.co/docs/hub/en/security-tokens).

```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
```

## 2. Preparation

### 2.1 Clone the repository 

Clone the current repository and change directories to the station-nanochat repository. 

```bash
cd station-nanochat
```

### 2.2 Nanochat Setup

After navigating to the assets folder and run the setup script to clone nanochat and build the Docker image on both nodes.

```bash
sh setup.sh
```

The setup script will:
- Clone the nanochat repository
- Copy the modified `speedrun_station.sh` script for training on station
- Build a custom Docker image for nanochat

Verify your directory structure after setup:

```
station-nanochat/assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
    ├── README.md
    ├── speedrun.sh (replaced with speedrun_station.sh)
    ├── scripts/
    ├── nanochat/
    └── ...
```

### 2.3 Verify Docker Image

Ensure the Docker image was built successfully on both nodes:

```bash
# On your system
docker images | grep nanochat
```

You should see the `nanochat` image listed on your system.

## 3. Training

### 3.1 Launch Training

Start training on DGX Station:

```bash
# Ensure that the previous environment variables are exported
# Launch training on both nodes
./ launch.sh 
```

The training script will automatically:
1. Download ~24GB of FineWeb pretraining data
2. Train a BPE tokenizer with 65K vocabulary
3. Pretrain a 561M parameter Transformer model (d20)
4. Run midtraining to teach conversation format
5. Fine-tune with supervised learning (SFT)
6. Generate evaluation reports

### 3.2 Training Duration

Expected training time on station:
- **Speedrun (d20)**: Upto 16 hours for 561M parameter model

The training uses PyTorch with:
- **Model Architecture**: GPT-style Transformer with 20 layers
- **Parameters**: 561M
- **Training Tokens**: ~11.2B tokens (Chinchilla optimal)
- **Optimizer**: Muon for pretraining, AdamW for finetuning
- **Precision**: Mixed precision (bfloat16)

### 3.3 Monitoring Training

To view the training progress via W&B, monitor any stage of nanochat training at:
```
https://wandb.ai/<your-username>/projects
```

Track key metrics:
- **Training loss**: Should decrease from ~3.5 to ~2.5
- **Validation loss**: Monitor for overfitting
- **Learning rate**: Follows cosine decay schedule
- **Throughput**: Tokens processed per second

### 3.4 Training Stages

The training pipeline consists of:

#### Stage 1: Tokenizer Training
- Downloads 2B characters from FineWeb dataset
- Trains BPE tokenizer with 65,536 vocabulary size
- Achieves ~4.8 characters per token compression

#### Stage 2: Base Model Pretraining
- Downloads 240 data shards (~24GB) from FineWeb
- Pretrains d20 model (561M params) on 11.2B tokens
- Evaluates on CORE benchmark (DCLM paper metrics)

#### Stage 3: Midtraining
- Introduces conversation special tokens (`<|im_start|>`, `<|im_end|>`)
- Trains on synthetic identity conversations
- Teaches model chat format and basic personality

#### Stage 4: Supervised Fine-tuning (SFT)
- Fine-tunes on SmolTalk dataset
- Improves conversation quality and instruction following
- Final model ready for chat inference

### 3.5 Checkpoints

Training checkpoints are automatically saved in `~/.cache/nanochat/`:
- `model_base.pt`: Pretrained base model
- `model_mid.pt`: After midtraining
- `model_sft.pt`: Final fine-tuned model
- `tokenizer.model`: Trained BPE tokenizer


## 4. Customization

For faster experimentation or testing the distributed setup, you can train a smaller model. This is useful for validating your infrastructure and workflow before committing to the full 5-day training run.

### 4.1 Remove existing nanochat installation

If you have previously run the setup, first remove the nanochat folder:

```bash
# From the assets directory
rm -rf nanochat
```

### 4.2 Modify speedrun_station.sh for minimal configuration

Edit `speedrun_station.sh` to use a smaller model configuration:

```bash
# Reduce data shards (50 shards instead of 240 for quick testing)
python -m nanochat.dataset -n 50 &

# Use depth=4 for minimal training
python -m scripts.base_train --depth=4 --device_batch_size=32
```

## 5. Inference

After training completes, you can interact with your trained model through multiple interfaces.

### 5.1 Web UI (Recommended)

Launch the ChatGPT-style web interface:

```bash

# Activate the virtual environment
cd nanochat
source ../.venv/bin/activate

# Start the web server
python -m scripts.chat_web
```

Access the UI at:
```
http://<SYSTEM_IP>:8000
```

The web UI provides:
- ChatGPT-style conversation interface
- Message history
- Real-time streaming responses
- Clean, modern design

### 5.2 Command Line Interface

For quick interactions via terminal:

```bash
# Interactive chat mode
python -m scripts.chat_cli

# Single prompt mode
python -m scripts.chat_cli -p "Why is the sky blue?"

# Specify checkpoint (base, mid, or sft)
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"
```

### 5.3 Sample Prompts

Try these prompts to test your model:

**Reasoning:**
```
Why do astronauts float in space?
```

**Math:**
```
A model trains for 3 epochs. Each epoch has 1000 steps and each step takes 0.5 seconds. How many minutes does the full training take?
```

**Code:**
```
Write a Python function to calculate fibonacci numbers
```

**Note:** The d20 speedrun model (561M parameters, ~4e19 FLOPs) is intentionally designed as a compact educational demonstration and has significant limitations. Expected behaviors include factual inaccuracies, hallucinations, and inconsistent reasoning. These characteristics are inherent to models trained with limited parameters and compute resources.

## 6. Evaluation Results

### 6.1 Training Report

After training completes, a comprehensive report is generated at `nanochat/report.md`. View it with:

```bash
cat nanochat/report.md
```

The report includes:
- System information and training configuration
- Training curves and loss plots
- Evaluation metrics across all benchmarks
- Sample generations at each training stage
- Total training time and cost breakdown


## 7. Architecture Details

### 7.1 Model Configuration (d20)

```
Layers: 20
Embedding Dimension: 1024
Attention Heads: 16
Context Length: 1024 tokens
Vocabulary Size: 65,536
Total Parameters: 561M
```

### 4.3 Re-run setup script

After modifying `speedrun_station.sh`, run the setup script again to deploy the changes to both nodes:

```bash

# Run setup to clone nanochat and build Docker images
sh setup.sh 
```

Then proceed with Section 3 (Training) to launch training.

## Troubleshooting

### Common Issues

**Issue**: Out of memory (OOM) errors
```
RuntimeError: CUDA out of memory
```
**Solution**: Reduce batch size in the training scripts:
```bash
--device_batch_size=16  # or 8, 4, 2, 1
```

**Issue**: Docker container not starting
**Solution**: 
- Check GPU availability: `nvidia-smi`
- Ensure no other containers using GPUs: `docker ps`
- Verify Docker has GPU access: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`

## Cleanup

### Stop Training

To stop training early, interrupt both containers:

```bash
# From the terminal running launch.sh
Ctrl+C

# Or manually stop containers
docker stop nanochat
```

### Clear Cache

To free up disk space after training:

```bash
# On both nodes
rm -rf ~/.cache/nanochat

# Clear Docker system
docker system prune -a
```

## Credits

This project is built upon:
- **[nanochat](https://github.com/karpathy/nanochat)** by Andrej Karpathy - The base LLM training framework
- **[nanoGPT](https://github.com/karpathy/nanoGPT)** by Andrej Karpathy - Inspiration for minimal LLM training
- **[modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt)** by Keller Jordan - Optimized training techniques
- **[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)** by HuggingFace - Pretraining dataset
- **[SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)** by HuggingFace - Finetuning dataset

## License

MIT - See nanochat repository for full license details.

---

**Note**: This is an educational project demonstrating distributed LLM training. The resulting model is a micro-model suitable for learning but not production use. For state-of-the-art performance, consider using pre-trained models like GPT-4, Claude, or Llama.