mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-21 21:59:30 +00:00
364 lines
9.8 KiB
Markdown
364 lines
9.8 KiB
Markdown
# Nanochat Training on DGX Station
|
|
|
|
This project demonstrates training of [nanochat](https://github.com/karpathy/nanochat), "the best ChatGPT that $100 can buy," on DGX Station. The demo includes tokenization, pretraining, midtraining, supervised fine-tuning (SFT), and inference through both CLI and web UI.
|
|
|
|
## Overview
|
|
|
|
The project includes:
|
|
- **Full LLM Pipeline**: Tokenizer training, pretraining, midtraining, and SFT
|
|
- **Custom Tokenizer**: BPE tokenizer with 65K vocabulary trained on FineWeb
|
|
- **Evaluation Suite**: CORE, ARC, GSM8K, HumanEval, MMLU benchmarks
|
|
- **Interactive Inference**: Chat with your model via CLI or web UI
|
|
- **Docker Support**: Complete containerized environment with PyTorch NGC
|
|
|
|
## Contents
|
|
1. [Environment Setup](#1-environment-setup)
|
|
2. [Preparation](#2-preparation)
|
|
3. [Training](#3-training)
|
|
4. [Customization](#4-customization)
|
|
5. [Inference](#5-inference)
|
|
6. [Evaluation Results](#6-evaluation-results)
|
|
7. [Architecture Details](#7-architecture-details)
|
|
|
|
## 1. Environment Setup
|
|
|
|
### 1.1 Prerequisites
|
|
|
|
Before starting, ensure you have:
|
|
- DGX Station with driver and CUDA toolkit setup
|
|
- Docker installed on the system
|
|
- Huggingface and WandB API access
|
|
|
|
### 1.4 Enviornment Setup
|
|
|
|
For training visualization and logging, set up your W&B API key. If you don't have a W&B account, you can create one at [wandb.ai](https://wandb.ai/). Additionally, a Huggingface token will be required for downloading certain datasets for model evaluation. Likewise, you can create a HF token by following the instructions at (huggingface.co](https://huggingface.co/docs/hub/en/security-tokens).
|
|
|
|
```bash
|
|
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
|
export HF_TOKEN=<YOUR_HF_TOKEN>
|
|
```
|
|
|
|
## 2. Preparation
|
|
|
|
### 2.1 Clone the repository
|
|
|
|
Clone the current repository and change directories to the station-nanochat repository.
|
|
|
|
```bash
|
|
cd station-nanochat
|
|
```
|
|
|
|
### 2.2 Nanochat Setup
|
|
|
|
After navigating to the assets folder and run the setup script to clone nanochat and build the Docker image on both nodes.
|
|
|
|
```bash
|
|
sh setup.sh
|
|
```
|
|
|
|
The setup script will:
|
|
- Clone the nanochat repository
|
|
- Copy the modified `speedrun_station.sh` script for training on station
|
|
- Build a custom Docker image for nanochat
|
|
|
|
Verify your directory structure after setup:
|
|
|
|
```
|
|
station-nanochat/assets/
|
|
├── Dockerfile
|
|
├── launch.sh
|
|
├── setup.sh
|
|
├── speedrun_station.sh
|
|
└── nanochat/
|
|
├── README.md
|
|
├── speedrun.sh (replaced with speedrun_station.sh)
|
|
├── scripts/
|
|
├── nanochat/
|
|
└── ...
|
|
```
|
|
|
|
### 2.3 Verify Docker Image
|
|
|
|
Ensure the Docker image was built successfully on both nodes:
|
|
|
|
```bash
|
|
# On your system
|
|
docker images | grep nanochat
|
|
```
|
|
|
|
You should see the `nanochat` image listed on your system.
|
|
|
|
## 3. Training
|
|
|
|
### 3.1 Launch Training
|
|
|
|
Start training on DGX Station:
|
|
|
|
```bash
|
|
# Ensure that the previous environment variables are exported
|
|
# Launch training on both nodes
|
|
./ launch.sh
|
|
```
|
|
|
|
The training script will automatically:
|
|
1. Download ~24GB of FineWeb pretraining data
|
|
2. Train a BPE tokenizer with 65K vocabulary
|
|
3. Pretrain a 561M parameter Transformer model (d20)
|
|
4. Run midtraining to teach conversation format
|
|
5. Fine-tune with supervised learning (SFT)
|
|
6. Generate evaluation reports
|
|
|
|
### 3.2 Training Duration
|
|
|
|
Expected training time on station:
|
|
- **Speedrun (d20)**: Upto 16 hours for 561M parameter model
|
|
|
|
The training uses PyTorch with:
|
|
- **Model Architecture**: GPT-style Transformer with 20 layers
|
|
- **Parameters**: 561M
|
|
- **Training Tokens**: ~11.2B tokens (Chinchilla optimal)
|
|
- **Optimizer**: Muon for pretraining, AdamW for finetuning
|
|
- **Precision**: Mixed precision (bfloat16)
|
|
|
|
### 3.3 Monitoring Training
|
|
|
|
To view the training progress via W&B, monitor any stage of nanochat training at:
|
|
```
|
|
https://wandb.ai/<your-username>/projects
|
|
```
|
|
|
|
Track key metrics:
|
|
- **Training loss**: Should decrease from ~3.5 to ~2.5
|
|
- **Validation loss**: Monitor for overfitting
|
|
- **Learning rate**: Follows cosine decay schedule
|
|
- **Throughput**: Tokens processed per second
|
|
|
|
### 3.4 Training Stages
|
|
|
|
The training pipeline consists of:
|
|
|
|
#### Stage 1: Tokenizer Training
|
|
- Downloads 2B characters from FineWeb dataset
|
|
- Trains BPE tokenizer with 65,536 vocabulary size
|
|
- Achieves ~4.8 characters per token compression
|
|
|
|
#### Stage 2: Base Model Pretraining
|
|
- Downloads 240 data shards (~24GB) from FineWeb
|
|
- Pretrains d20 model (561M params) on 11.2B tokens
|
|
- Evaluates on CORE benchmark (DCLM paper metrics)
|
|
|
|
#### Stage 3: Midtraining
|
|
- Introduces conversation special tokens (`<|im_start|>`, `<|im_end|>`)
|
|
- Trains on synthetic identity conversations
|
|
- Teaches model chat format and basic personality
|
|
|
|
#### Stage 4: Supervised Fine-tuning (SFT)
|
|
- Fine-tunes on SmolTalk dataset
|
|
- Improves conversation quality and instruction following
|
|
- Final model ready for chat inference
|
|
|
|
### 3.5 Checkpoints
|
|
|
|
Training checkpoints are automatically saved in `~/.cache/nanochat/`:
|
|
- `model_base.pt`: Pretrained base model
|
|
- `model_mid.pt`: After midtraining
|
|
- `model_sft.pt`: Final fine-tuned model
|
|
- `tokenizer.model`: Trained BPE tokenizer
|
|
|
|
|
|
## 4. Customization
|
|
|
|
For faster experimentation or testing the distributed setup, you can train a smaller model. This is useful for validating your infrastructure and workflow before committing to the full 5-day training run.
|
|
|
|
### 4.1 Remove existing nanochat installation
|
|
|
|
If you have previously run the setup, first remove the nanochat folder:
|
|
|
|
```bash
|
|
# From the assets directory
|
|
rm -rf nanochat
|
|
```
|
|
|
|
### 4.2 Modify speedrun_station.sh for minimal configuration
|
|
|
|
Edit `speedrun_station.sh` to use a smaller model configuration:
|
|
|
|
```bash
|
|
# Reduce data shards (50 shards instead of 240 for quick testing)
|
|
python -m nanochat.dataset -n 50 &
|
|
|
|
# Use depth=4 for minimal training
|
|
python -m scripts.base_train --depth=4 --device_batch_size=32
|
|
```
|
|
|
|
## 5. Inference
|
|
|
|
After training completes, you can interact with your trained model through multiple interfaces.
|
|
|
|
### 5.1 Web UI (Recommended)
|
|
|
|
Launch the ChatGPT-style web interface:
|
|
|
|
```bash
|
|
|
|
# Activate the virtual environment
|
|
cd nanochat
|
|
source ../.venv/bin/activate
|
|
|
|
# Start the web server
|
|
python -m scripts.chat_web
|
|
```
|
|
|
|
Access the UI at:
|
|
```
|
|
http://<SYSTEM_IP>:8000
|
|
```
|
|
|
|
The web UI provides:
|
|
- ChatGPT-style conversation interface
|
|
- Message history
|
|
- Real-time streaming responses
|
|
- Clean, modern design
|
|
|
|
### 5.2 Command Line Interface
|
|
|
|
For quick interactions via terminal:
|
|
|
|
```bash
|
|
# Interactive chat mode
|
|
python -m scripts.chat_cli
|
|
|
|
# Single prompt mode
|
|
python -m scripts.chat_cli -p "Why is the sky blue?"
|
|
|
|
# Specify checkpoint (base, mid, or sft)
|
|
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"
|
|
```
|
|
|
|
### 5.3 Sample Prompts
|
|
|
|
Try these prompts to test your model:
|
|
|
|
**Reasoning:**
|
|
```
|
|
Why do astronauts float in space?
|
|
```
|
|
|
|
**Math:**
|
|
```
|
|
A model trains for 3 epochs. Each epoch has 1000 steps and each step takes 0.5 seconds. How many minutes does the full training take?
|
|
```
|
|
|
|
**Code:**
|
|
```
|
|
Write a Python function to calculate fibonacci numbers
|
|
```
|
|
|
|
**Note:** The d20 speedrun model (561M parameters, ~4e19 FLOPs) is intentionally designed as a compact educational demonstration and has significant limitations. Expected behaviors include factual inaccuracies, hallucinations, and inconsistent reasoning. These characteristics are inherent to models trained with limited parameters and compute resources.
|
|
|
|
## 6. Evaluation Results
|
|
|
|
### 6.1 Training Report
|
|
|
|
After training completes, a comprehensive report is generated at `nanochat/report.md`. View it with:
|
|
|
|
```bash
|
|
cat nanochat/report.md
|
|
```
|
|
|
|
The report includes:
|
|
- System information and training configuration
|
|
- Training curves and loss plots
|
|
- Evaluation metrics across all benchmarks
|
|
- Sample generations at each training stage
|
|
- Total training time and cost breakdown
|
|
|
|
|
|
## 7. Architecture Details
|
|
|
|
### 7.1 Model Configuration (d20)
|
|
|
|
```
|
|
Layers: 20
|
|
Embedding Dimension: 1024
|
|
Attention Heads: 16
|
|
Context Length: 1024 tokens
|
|
Vocabulary Size: 65,536
|
|
Total Parameters: 561M
|
|
```
|
|
|
|
### 4.3 Re-run setup script
|
|
|
|
After modifying `speedrun_station.sh`, run the setup script again to deploy the changes to both nodes:
|
|
|
|
```bash
|
|
|
|
# Run setup to clone nanochat and build Docker images
|
|
sh setup.sh
|
|
```
|
|
|
|
Then proceed with Section 3 (Training) to launch training.
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Issue**: Out of memory (OOM) errors
|
|
```
|
|
RuntimeError: CUDA out of memory
|
|
```
|
|
**Solution**: Reduce batch size in the training scripts:
|
|
```bash
|
|
--device_batch_size=16 # or 8, 4, 2, 1
|
|
```
|
|
|
|
**Issue**: Docker container not starting
|
|
**Solution**:
|
|
- Check GPU availability: `nvidia-smi`
|
|
- Ensure no other containers using GPUs: `docker ps`
|
|
- Verify Docker has GPU access: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`
|
|
|
|
## Cleanup
|
|
|
|
### Stop Training
|
|
|
|
To stop training early, interrupt both containers:
|
|
|
|
```bash
|
|
# From the terminal running launch.sh
|
|
Ctrl+C
|
|
|
|
# Or manually stop containers
|
|
docker stop nanochat
|
|
```
|
|
|
|
### Clear Cache
|
|
|
|
To free up disk space after training:
|
|
|
|
```bash
|
|
# On both nodes
|
|
rm -rf ~/.cache/nanochat
|
|
|
|
# Clear Docker system
|
|
docker system prune -a
|
|
```
|
|
|
|
## Credits
|
|
|
|
This project is built upon:
|
|
- **[nanochat](https://github.com/karpathy/nanochat)** by Andrej Karpathy - The base LLM training framework
|
|
- **[nanoGPT](https://github.com/karpathy/nanoGPT)** by Andrej Karpathy - Inspiration for minimal LLM training
|
|
- **[modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt)** by Keller Jordan - Optimized training techniques
|
|
- **[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)** by HuggingFace - Pretraining dataset
|
|
- **[SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)** by HuggingFace - Finetuning dataset
|
|
|
|
## License
|
|
|
|
MIT - See nanochat repository for full license details.
|
|
|
|
---
|
|
|
|
**Note**: This is an educational project demonstrating distributed LLM training. The resulting model is a micro-model suitable for learning but not production use. For state-of-the-art performance, consider using pre-trained models like GPT-4, Claude, or Llama.
|
|
|