mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-20 13:19:34 +00:00

History

GitLab CI 3eff7461e1 chore: Regenerate all playbooks		2026-06-02 18:47:24 +00:00
..
Dockerfile	chore: Regenerate all playbooks	2026-06-02 18:47:24 +00:00
launch.sh	chore: Regenerate all playbooks	2026-05-29 15:56:45 +00:00
README.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
setup.sh	chore: Regenerate all playbooks	2026-05-29 15:56:45 +00:00
speedrun_station.sh	chore: Regenerate all playbooks	2026-05-29 15:56:45 +00:00

README.md

Nanochat Training on DGX Station

This project demonstrates training of nanochat, "the best ChatGPT that $100 can buy," on DGX Station. The demo includes tokenization, pretraining, midtraining, supervised fine-tuning (SFT), and inference through both CLI and web UI.

Overview

The project includes:

Full LLM Pipeline: Tokenizer training, pretraining, midtraining, and SFT
Custom Tokenizer: BPE tokenizer with 65K vocabulary trained on FineWeb
Evaluation Suite: CORE, ARC, GSM8K, HumanEval, MMLU benchmarks
Interactive Inference: Chat with your model via CLI or web UI
Docker Support: Complete containerized environment with PyTorch NGC

Environment Setup
Preparation
Training
Customization
Inference
Evaluation Results
Architecture Details

1. Environment Setup

1.1 Prerequisites

Before starting, ensure you have:

DGX Station with driver and CUDA toolkit setup
Docker installed on the system
Huggingface and WandB API access

1.4 Enviornment Setup

For training visualization and logging, set up your W&B API key. If you don't have a W&B account, you can create one at wandb.ai. Additionally, a Huggingface token will be required for downloading certain datasets for model evaluation. Likewise, you can create a HF token by following the instructions at (huggingface.co](https://huggingface.co/docs/hub/en/security-tokens).

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

2. Preparation

2.1 Clone the repository

Clone the current repository and change directories to the station-nanochat repository.

cd station-nanochat

2.2 Nanochat Setup

After navigating to the assets folder and run the setup script to clone nanochat and build the Docker image on both nodes.

sh setup.sh

The setup script will:

Clone the nanochat repository
Copy the modified speedrun_station.sh script for training on station
Build a custom Docker image for nanochat

Verify your directory structure after setup:

station-nanochat/assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
    ├── README.md
    ├── speedrun.sh (replaced with speedrun_station.sh)
    ├── scripts/
    ├── nanochat/
    └── ...

2.3 Verify Docker Image

Ensure the Docker image was built successfully on both nodes:

# On your system
docker images | grep nanochat

You should see the nanochat image listed on your system.

3. Training

3.1 Launch Training

Start training on DGX Station:

# Ensure that the previous environment variables are exported
# Launch training on both nodes
./ launch.sh

The training script will automatically:

Download ~24GB of FineWeb pretraining data
Train a BPE tokenizer with 65K vocabulary
Pretrain a 561M parameter Transformer model (d20)
Run midtraining to teach conversation format
Fine-tune with supervised learning (SFT)
Generate evaluation reports

3.2 Training Duration

Expected training time on station:

Speedrun (d20): Upto 16 hours for 561M parameter model

The training uses PyTorch with:

Model Architecture: GPT-style Transformer with 20 layers
Parameters: 561M
Training Tokens: ~11.2B tokens (Chinchilla optimal)
Optimizer: Muon for pretraining, AdamW for finetuning
Precision: Mixed precision (bfloat16)

3.3 Monitoring Training

To view the training progress via W&B, monitor any stage of nanochat training at:

https://wandb.ai/<your-username>/projects

Track key metrics:

Training loss: Should decrease from ~3.5 to ~2.5
Validation loss: Monitor for overfitting
Learning rate: Follows cosine decay schedule
Throughput: Tokens processed per second

3.4 Training Stages

The training pipeline consists of:

Stage 1: Tokenizer Training

Downloads 2B characters from FineWeb dataset
Trains BPE tokenizer with 65,536 vocabulary size
Achieves ~4.8 characters per token compression

Stage 2: Base Model Pretraining

Downloads 240 data shards (~24GB) from FineWeb
Pretrains d20 model (561M params) on 11.2B tokens
Evaluates on CORE benchmark (DCLM paper metrics)

Stage 3: Midtraining

Introduces conversation special tokens (<|im_start|>, <|im_end|>)
Trains on synthetic identity conversations
Teaches model chat format and basic personality

Stage 4: Supervised Fine-tuning (SFT)

Fine-tunes on SmolTalk dataset
Improves conversation quality and instruction following
Final model ready for chat inference

3.5 Checkpoints

Training checkpoints are automatically saved in ~/.cache/nanochat/:

model_base.pt: Pretrained base model
model_mid.pt: After midtraining
model_sft.pt: Final fine-tuned model
tokenizer.model: Trained BPE tokenizer

4. Customization

For faster experimentation or testing the distributed setup, you can train a smaller model. This is useful for validating your infrastructure and workflow before committing to the full 5-day training run.

4.1 Remove existing nanochat installation

If you have previously run the setup, first remove the nanochat folder:

# From the assets directory
rm -rf nanochat

4.2 Modify speedrun_station.sh for minimal configuration

Edit speedrun_station.sh to use a smaller model configuration:

# Reduce data shards (50 shards instead of 240 for quick testing)
python -m nanochat.dataset -n 50 &

# Use depth=4 for minimal training
python -m scripts.base_train --depth=4 --device_batch_size=32

5. Inference

After training completes, you can interact with your trained model through multiple interfaces.

5.1 Web UI (Recommended)

Launch the ChatGPT-style web interface:


# Activate the virtual environment
cd nanochat
source ../.venv/bin/activate

# Start the web server
python -m scripts.chat_web

Access the UI at:

http://<SYSTEM_IP>:8000

The web UI provides:

ChatGPT-style conversation interface
Message history
Real-time streaming responses
Clean, modern design

5.2 Command Line Interface

For quick interactions via terminal:

# Interactive chat mode
python -m scripts.chat_cli

# Single prompt mode
python -m scripts.chat_cli -p "Why is the sky blue?"

# Specify checkpoint (base, mid, or sft)
python -m scripts.chat_cli -i sft -p "Write me a haiku about distributed training"

5.3 Sample Prompts

Try these prompts to test your model:

Reasoning:

Why do astronauts float in space?

Math:

A model trains for 3 epochs. Each epoch has 1000 steps and each step takes 0.5 seconds. How many minutes does the full training take?

Code:

Write a Python function to calculate fibonacci numbers

Note: The d20 speedrun model (561M parameters, ~4e19 FLOPs) is intentionally designed as a compact educational demonstration and has significant limitations. Expected behaviors include factual inaccuracies, hallucinations, and inconsistent reasoning. These characteristics are inherent to models trained with limited parameters and compute resources.

6. Evaluation Results

6.1 Training Report

After training completes, a comprehensive report is generated at nanochat/report.md. View it with:

cat nanochat/report.md

The report includes:

System information and training configuration
Training curves and loss plots
Evaluation metrics across all benchmarks
Sample generations at each training stage
Total training time and cost breakdown

7. Architecture Details

7.1 Model Configuration (d20)

Layers: 20
Embedding Dimension: 1024
Attention Heads: 16
Context Length: 1024 tokens
Vocabulary Size: 65,536
Total Parameters: 561M

4.3 Re-run setup script

After modifying speedrun_station.sh, run the setup script again to deploy the changes to both nodes:


# Run setup to clone nanochat and build Docker images
sh setup.sh

Then proceed with Section 3 (Training) to launch training.

Troubleshooting

Common Issues

Issue: Out of memory (OOM) errors

RuntimeError: CUDA out of memory

Solution: Reduce batch size in the training scripts:

--device_batch_size=16  # or 8, 4, 2, 1

Issue: Docker container not starting Solution:

Check GPU availability: nvidia-smi
Ensure no other containers using GPUs: docker ps
Verify Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Cleanup

Stop Training

To stop training early, interrupt both containers:

# From the terminal running launch.sh
Ctrl+C

# Or manually stop containers
docker stop nanochat

Clear Cache

To free up disk space after training:

# On both nodes
rm -rf ~/.cache/nanochat

# Clear Docker system
docker system prune -a

Credits

This project is built upon:

nanochat by Andrej Karpathy - The base LLM training framework
nanoGPT by Andrej Karpathy - Inspiration for minimal LLM training
modded-nanoGPT by Keller Jordan - Optimized training techniques
FineWeb by HuggingFace - Pretraining dataset
SmolTalk by HuggingFace - Finetuning dataset

License

MIT - See nanochat repository for full license details.

Note: This is an educational project demonstrating distributed LLM training. The resulting model is a micro-model suitable for learning but not production use. For state-of-the-art performance, consider using pre-trained models like GPT-4, Claude, or Llama.