mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 04:22:21 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
a9383bb067
commit
6942395d72
@ -15,18 +15,16 @@
|
||||
|
||||
## Basic idea
|
||||
|
||||
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
|
||||
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
|
||||
|
||||
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
|
||||
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You will have a working nanochat setup that trains a small LLM and serves it for chat.
|
||||
|
||||
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
|
||||
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
|
||||
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
|
||||
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
|
||||
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
|
||||
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
|
||||
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
|
||||
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
|
||||
|
||||
## What to know before starting
|
||||
|
||||
@ -38,104 +36,143 @@ You will have a working nanochat setup that trains a small LLM and serves it for
|
||||
|
||||
**Hardware:**
|
||||
|
||||
- NVIDIA DGX Station with GB300 Ultra Superchip.
|
||||
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
|
||||
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
|
||||
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
|
||||
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
|
||||
|
||||
**Software:**
|
||||
|
||||
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
|
||||
- Network access to download datasets (Hugging Face, FineWeb) and container images.
|
||||
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
|
||||
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
|
||||
- [Weights & Biases](https://wandb.ai/) account and API key.
|
||||
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
|
||||
|
||||
## Model architecture (d24)
|
||||
|
||||
```
|
||||
Layers: 24
|
||||
Attention Heads: 12
|
||||
Head Dimension: 128
|
||||
Context Length: 2048 tokens
|
||||
Vocabulary Size: 65,536 (2^16, trained BPE)
|
||||
Precision: FP8 (e4m3, tensorwise scaling)
|
||||
```
|
||||
|
||||
## Training stages
|
||||
|
||||
| Stage | Description |
|
||||
|-------|-------------|
|
||||
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
|
||||
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
|
||||
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
|
||||
| Report | Generates `report.md` with metrics, samples, and system info |
|
||||
|
||||
## Ancillary files
|
||||
|
||||
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
|
||||
|
||||
- `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv.
|
||||
- `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image.
|
||||
- `assets/launch.sh` – Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
|
||||
- `assets/README.md` – Additional detail on training stages, inference, and troubleshooting.
|
||||
All required assets are in `nvidia/station-nanochat/assets/`:
|
||||
|
||||
- `Dockerfile` – PyTorch NGC image with nanochat pip dependencies.
|
||||
- `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
|
||||
- `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
|
||||
- `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station.
|
||||
|
||||
## Time & risk
|
||||
|
||||
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
|
||||
- **Risk level:** Medium
|
||||
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
|
||||
- API keys (W&B, HF) must be set or the launch script will exit.
|
||||
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
|
||||
- **Risk level:** Medium
|
||||
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
|
||||
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
|
||||
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
|
||||
|
||||
## Credits
|
||||
|
||||
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
|
||||
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
|
||||
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Prerequisites and environment
|
||||
|
||||
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
|
||||
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
|
||||
|
||||
```bash
|
||||
## Verify GPU and Docker
|
||||
nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
|
||||
```
|
||||
|
||||
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
|
||||
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
|
||||
|
||||
```bash
|
||||
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
```
|
||||
|
||||
## Step 2. Clone the playbook and set up nanochat
|
||||
## Step 2. Clone and set up
|
||||
|
||||
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
|
||||
Clone the playbook repository and navigate to the assets directory:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||||
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
|
||||
```
|
||||
|
||||
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
|
||||
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
|
||||
|
||||
```bash
|
||||
./setup.sh
|
||||
```
|
||||
|
||||
Setup may take several minutes while the image builds. Verify the image:
|
||||
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
|
||||
|
||||
```bash
|
||||
docker images | grep nanochat
|
||||
```
|
||||
assets/
|
||||
├── Dockerfile
|
||||
├── launch.sh
|
||||
├── setup.sh
|
||||
├── speedrun_station.sh
|
||||
└── nanochat/
|
||||
```
|
||||
|
||||
You should see the `nanochat` image listed.
|
||||
## Step 3. Launch training
|
||||
|
||||
## Step 3. Launch full training
|
||||
|
||||
> [!NOTE]
|
||||
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
|
||||
|
||||
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
|
||||
Ensure your API keys are exported, then launch:
|
||||
|
||||
```bash
|
||||
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
./launch_full.sh
|
||||
./launch.sh
|
||||
```
|
||||
|
||||
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
|
||||
The training runs inside the `nanochat` container and executes the full pipeline automatically:
|
||||
|
||||
## Step 4. Verify and use the model
|
||||
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
|
||||
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
|
||||
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
|
||||
4. **Report generation** — produces `report.md` with metrics and samples
|
||||
|
||||
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
|
||||
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
|
||||
|
||||
## Step 4. Monitor training
|
||||
|
||||
**W&B dashboard:**
|
||||
|
||||
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
|
||||
- Training loss
|
||||
- Validation BPB
|
||||
- Throughput (tokens/sec)
|
||||
|
||||
## Step 5. Inference
|
||||
|
||||
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
|
||||
|
||||
**Web UI (recommended):**
|
||||
|
||||
```bash
|
||||
cd nanochat
|
||||
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
|
||||
python -m scripts.chat_web
|
||||
docker run --rm --gpus all --net=host \
|
||||
-v $(pwd)/nanochat:/workspace/nanochat \
|
||||
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
python -m scripts.chat_web
|
||||
```
|
||||
|
||||
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Station’s IP address.
|
||||
@ -143,14 +180,15 @@ Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX St
|
||||
**CLI:**
|
||||
|
||||
```bash
|
||||
cd nanochat
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
|
||||
docker run --rm -it --gpus all \
|
||||
-v $(pwd)/nanochat:/workspace/nanochat \
|
||||
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
```
|
||||
|
||||
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
|
||||
|
||||
## Step 5. Cleanup
|
||||
## Step 6. Cleanup
|
||||
|
||||
To stop training early, interrupt the launch script or stop the container:
|
||||
|
||||
@ -160,32 +198,43 @@ To stop training early, interrupt the launch script or stop the container:
|
||||
```bash
|
||||
## If launch.sh is running: press Ctrl+C
|
||||
|
||||
## Or stop the container by name
|
||||
## Or stop the container directly
|
||||
docker stop $(docker ps -q --filter ancestor=nanochat)
|
||||
```
|
||||
|
||||
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
|
||||
To free disk space:
|
||||
|
||||
```bash
|
||||
rm -rf ./nanochat_cache ./hf_cache
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
## Step 6. Next steps and customization
|
||||
## Step 7. Customization
|
||||
|
||||
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
|
||||
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
|
||||
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
|
||||
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
|
||||
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
|
||||
|
||||
```bash
|
||||
## Fewer data shards (10 instead of default)
|
||||
python -m nanochat.dataset -n 10 &
|
||||
|
||||
## Smaller model (d4 instead of d24), smaller batch size
|
||||
python -m scripts.base_train --depth=4 --device-batch-size=32
|
||||
```
|
||||
|
||||
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
|
||||
|
||||
Then re-run `./setup.sh` to rebuild with the changes.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|--------|--------|-----|
|
||||
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
|
||||
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
|
||||
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
|
||||
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` don’t exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
|
||||
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
|
||||
| Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
|
||||
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|
||||
|---------|-------|-----|
|
||||
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
|
||||
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
|
||||
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
|
||||
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
|
||||
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
|
||||
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
|
||||
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
|
||||
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
|
||||
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |
|
||||
|
||||
@ -1,11 +1,15 @@
|
||||
FROM nvcr.io/nvidia/pytorch:25.09-py3
|
||||
FROM nvcr.io/nvidia/pytorch:26.04-py3
|
||||
|
||||
WORKDIR /workspace
|
||||
|
||||
# Install dependencies globally so torchrun (which uses /usr/bin/python) can access them
|
||||
RUN /usr/bin/python -m pip install tiktoken tokenizers datasets psutil files-to-prompt regex setuptools uvicorn wandb maturin
|
||||
|
||||
# Create venv with --system-site-packages so it inherits global packages
|
||||
RUN /usr/bin/python -m venv --system-site-packages .venv
|
||||
RUN pip install \
|
||||
datasets \
|
||||
tokenizers \
|
||||
wandb \
|
||||
tiktoken \
|
||||
psutil \
|
||||
files-to-prompt \
|
||||
uvicorn \
|
||||
rustbpe
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
@ -3,7 +3,6 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Lite training (default). Runs speedrun.sh, which setup copies from speedrun_lite.sh.
|
||||
|
||||
# Get wandb API key
|
||||
export WANDB_API_KEY=$WANDB_API_KEY
|
||||
@ -11,7 +10,6 @@ if [ -z "$WANDB_API_KEY" ]; then
|
||||
echo "WANDB_API_KEY is not set"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
export WANDB_RUN=${WANDB_RUN:-speedrun}
|
||||
|
||||
# Get Hugging Face API key
|
||||
@ -21,26 +19,23 @@ if [ -z "$HF_TOKEN" ]; then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Cleanup function to stop containers
|
||||
# Use local cache dirs so no root paths are required
|
||||
workdir=$(pwd)
|
||||
NANOCHAT_CACHE="$(pwd)/nanochat_cache"
|
||||
HF_CACHE="$(pwd)/hf_cache"
|
||||
|
||||
cleanup() {
|
||||
echo
|
||||
echo "Stopping containers..."
|
||||
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null || true
|
||||
echo "Interrupted training!"
|
||||
echo -e "\nStopping training container..."
|
||||
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null
|
||||
echo "Cleanup complete."
|
||||
exit 0
|
||||
}
|
||||
|
||||
workdir=$(pwd)
|
||||
# DGX Station: use local cache dirs so no root paths are required
|
||||
NANOCHAT_CACHE="${NANOCHAT_CACHE:-$(pwd)/nanochat_cache}"
|
||||
HF_CACHE="${HF_CACHE:-$(pwd)/hf_cache}"
|
||||
mkdir -p "$NANOCHAT_CACHE" "$HF_CACHE"
|
||||
trap cleanup SIGINT SIGTERM
|
||||
|
||||
cmd="
|
||||
mkdir -p /nanochat_cache && \
|
||||
mkdir -p /hf_cache && \
|
||||
chmod 777 /nanochat_cache && \
|
||||
chmod 777 /hf_cache && \
|
||||
# Launch Nanochat training
|
||||
cmd="mkdir -p $NANOCHAT_CACHE $HF_CACHE && \
|
||||
chmod u+rwx $NANOCHAT_CACHE $HF_CACHE && \
|
||||
docker run \
|
||||
--rm \
|
||||
--runtime=nvidia \
|
||||
@ -57,16 +52,8 @@ docker run \
|
||||
-v $HF_CACHE:/root/.cache/huggingface \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
bash speedrun.sh"
|
||||
|
||||
bash runs/speedrun.sh"
|
||||
sh -c "$cmd" &
|
||||
|
||||
sleep 5
|
||||
while true; do
|
||||
if ! docker ps | grep -q "nanochat"; then
|
||||
echo
|
||||
echo "Training complete!"
|
||||
exit 0
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
wait
|
||||
echo -e "\nTraining complete!"
|
||||
|
||||
@ -11,10 +11,10 @@ assets_dir="$(cd "$(dirname "$0")" && pwd)"
|
||||
cmd="cd $workdir && \
|
||||
git clone https://github.com/karpathy/nanochat.git && \
|
||||
cd nanochat && \
|
||||
git checkout c6b7ab744055d5915e6ccb61088de80c10cbaff9 && \
|
||||
cp ../speedrun_spark.sh ./speedrun.sh && \
|
||||
git checkout 0aaca56805eb13f6e6e1fff789a08086902f12ab && \
|
||||
cp ../speedrun_station.sh ./runs/speedrun.sh && \
|
||||
cd .. && \
|
||||
chmod +x launch_full.sh 2>/dev/null || true && \
|
||||
chmod +x launch.sh 2>/dev/null || true && \
|
||||
docker build -t nanochat ."
|
||||
|
||||
sh -c "$cmd"
|
||||
|
||||
@ -1,15 +1,14 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# This script is the "Best ChatGPT clone that $100 can buy",
|
||||
# It is designed to run in ~4 hours on 8XH100 node at $3/GPU/hour.
|
||||
# This script is configured to train your own GPT-2 grade LLM (pretraining + finetuning)
|
||||
# It is designed to run on a blank 8XH100 GPU node and takes approximately 3 hours to complete.
|
||||
|
||||
# 1) Example launch (simplest):
|
||||
# bash speedrun.sh
|
||||
# 2) Example launch in a screen session (because the run takes ~4 hours):
|
||||
# screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
|
||||
# bash runs/speedrun.sh
|
||||
# 2) Example launch in a screen session (because the run takes ~3 hours):
|
||||
# screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
|
||||
# 3) Example launch with wandb logging, but see below for setting up wandb first:
|
||||
# WANDB_RUN=speedrun screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
|
||||
# WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
|
||||
|
||||
# Default intermediate artifacts directory is in ~/.cache/nanochat
|
||||
export OMP_NUM_THREADS=1
|
||||
@ -26,7 +25,7 @@ mkdir -p $NANOCHAT_BASE_DIR
|
||||
# install the repo dependencies
|
||||
# uv sync --extra gpu
|
||||
# activate venv so that `python` uses the project's venv instead of system python
|
||||
source ../.venv/bin/activate
|
||||
# source .venv/bin/activate
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# wandb setup
|
||||
@ -49,70 +48,41 @@ python -m nanochat.report reset
|
||||
# -----------------------------------------------------------------------------
|
||||
# Tokenizer
|
||||
|
||||
# Install Rust / Cargo
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
|
||||
source "$HOME/.cargo/env"
|
||||
|
||||
# Build the rustbpe Tokenizer
|
||||
# unset VIRTUAL_ENV
|
||||
maturin develop --release --manifest-path rustbpe/Cargo.toml
|
||||
|
||||
# Download the first ~2B characters of pretraining dataset
|
||||
# look at dev/repackage_data_reference.py for details on how this data was prepared
|
||||
# each data shard is ~250M chars
|
||||
# so we download 2e9 / 250e6 = 8 data shards at this point
|
||||
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
|
||||
# look at dev/repackage_data_reference.py for details on how this data was prepared
|
||||
python -m nanochat.dataset -n 8
|
||||
# Immediately also kick off downloading more shards in the background while tokenizer trains
|
||||
# See comment below for why 240 is the right number here
|
||||
python -m nanochat.dataset -n 240 &
|
||||
# Approximately 150 shards are needed for GPT-2 capability pretraining, add 20 for padding.
|
||||
# The maximum total number of shards available in the entire dataset is 6542.
|
||||
python -m nanochat.dataset -n 170 &
|
||||
DATASET_DOWNLOAD_PID=$!
|
||||
# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data
|
||||
python -m scripts.tok_train --max_chars=2000000000
|
||||
# train the tokenizer with vocab size 2**15 = 32768 on ~2B characters of data
|
||||
python -m scripts.tok_train
|
||||
# evaluate the tokenizer (report compression ratio etc.)
|
||||
python -m scripts.tok_eval
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Base model (pretraining)
|
||||
|
||||
# The d20 model is 561M parameters.
|
||||
# Chinchilla says #tokens = 20X #params, so we need 561e6 * 20 = 11.2B tokens.
|
||||
# Assume our tokenizer is 4.8 chars/token, this is 11.2B * 4.8 ~= 54B chars.
|
||||
# At 250M chars/shard, this is 54B / 250M ~= 216 shards needed for pretraining.
|
||||
# Round up to 240 for safety. At ~100MB/shard, this downloads ~24GB of data to disk.
|
||||
# (The total number of shards available in the entire dataset is 1822.)
|
||||
echo "Waiting for dataset download to complete..."
|
||||
wait $DATASET_DOWNLOAD_PID
|
||||
|
||||
source ../.venv/bin/activate
|
||||
|
||||
# pretrain the d20 model
|
||||
python -m scripts.base_train --depth=20 --run=$WANDB_RUN
|
||||
# evaluate the model on a larger chunk of train/val data and draw some samples
|
||||
python -m scripts.base_loss
|
||||
# evaluate the model on CORE tasks
|
||||
python -m scripts.base_eval
|
||||
|
||||
sleep 5
|
||||
# d24 model (slightly undertrained to beat GPT-2 => decrease data:params ratio from compute optimal 10.5 (default) to 8)
|
||||
python -m scripts.base_train --depth=24 --target-param-data-ratio=8 --device-batch-size=64 --fp8 --run=$WANDB_RUN
|
||||
# evaluate the model: CORE metric, BPB on train/val, and draw samples
|
||||
python -m scripts.base_eval --device-batch-size=64
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Midtraining (teach the model conversation special tokens, tool use, multiple choice)
|
||||
# SFT (teach the model conversation special tokens, tool use, multiple choice)
|
||||
|
||||
# download 2.3MB of synthetic identity conversations to impart a personality to nanochat
|
||||
# see dev/gen_sft_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
|
||||
# see dev/gen_synthetic_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
|
||||
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
|
||||
|
||||
# run midtraining and eval the model
|
||||
python -m scripts.mid_train --run=$WANDB_RUN
|
||||
python -m scripts.chat_eval -i mid
|
||||
|
||||
sleep 5
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Supervised Finetuning (domain adaptation to each sequence all by itself per row)
|
||||
|
||||
# train sft and re-eval right away (should see a small bump)
|
||||
python -m scripts.chat_sft --run=$WANDB_RUN
|
||||
# run SFT and eval the model
|
||||
python -m scripts.chat_sft --device-batch-size=64 --run=$WANDB_RUN
|
||||
python -m scripts.chat_eval -i sft
|
||||
|
||||
# chat with the model over CLI! Leave out the -p to chat interactively
|
||||
@ -121,15 +91,6 @@ python -m scripts.chat_eval -i sft
|
||||
# even better, chat with your model over a pretty WebUI ChatGPT style
|
||||
# python -m scripts.chat_web
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Reinforcement Learning. Optional, and currently only on GSM8K
|
||||
# (optional)
|
||||
|
||||
# run reinforcement learning
|
||||
# python -m scripts.chat_rl --run=$WANDB_RUN
|
||||
# eval the RL model only on GSM8K
|
||||
# python -m scripts.chat_eval --i rl -a GSM8K
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Generate the full report by putting together all the sections
|
||||
# report.md is the output and will be copied to current directory for convenience
|
||||
|
||||
@ -45,18 +45,16 @@ spec:
|
||||
content: |
|
||||
# Basic idea
|
||||
|
||||
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
|
||||
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
|
||||
|
||||
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
|
||||
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
You will have a working nanochat setup that trains a small LLM and serves it for chat.
|
||||
|
||||
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
|
||||
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
|
||||
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
|
||||
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
|
||||
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
|
||||
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
|
||||
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
|
||||
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
@ -68,36 +66,58 @@ spec:
|
||||
|
||||
**Hardware:**
|
||||
|
||||
- NVIDIA DGX Station with GB300 Ultra Superchip.
|
||||
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
|
||||
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
|
||||
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
|
||||
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
|
||||
|
||||
**Software:**
|
||||
|
||||
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
|
||||
- Network access to download datasets (Hugging Face, FineWeb) and container images.
|
||||
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
|
||||
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
|
||||
- [Weights & Biases](https://wandb.ai/) account and API key.
|
||||
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
|
||||
|
||||
# Model architecture (d24)
|
||||
|
||||
```
|
||||
Layers: 24
|
||||
Attention Heads: 12
|
||||
Head Dimension: 128
|
||||
Context Length: 2048 tokens
|
||||
Vocabulary Size: 65,536 (2^16, trained BPE)
|
||||
Precision: FP8 (e4m3, tensorwise scaling)
|
||||
```
|
||||
|
||||
# Training stages
|
||||
|
||||
| Stage | Description |
|
||||
|-------|-------------|
|
||||
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
|
||||
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
|
||||
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
|
||||
| Report | Generates `report.md` with metrics, samples, and system info |
|
||||
|
||||
# Ancillary files
|
||||
|
||||
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
|
||||
|
||||
- `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv.
|
||||
- `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image.
|
||||
- `assets/launch.sh` – Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
|
||||
- `assets/README.md` – Additional detail on training stages, inference, and troubleshooting.
|
||||
All required assets are in `nvidia/station-nanochat/assets/`:
|
||||
|
||||
- `Dockerfile` – PyTorch NGC image with nanochat pip dependencies.
|
||||
- `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
|
||||
- `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
|
||||
- `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station.
|
||||
|
||||
# Time & risk
|
||||
|
||||
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
|
||||
- **Risk level:** Medium
|
||||
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
|
||||
- API keys (W&B, HF) must be set or the launch script will exit.
|
||||
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
|
||||
- **Risk level:** Medium
|
||||
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
|
||||
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
|
||||
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
|
||||
|
||||
# Credits
|
||||
|
||||
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
|
||||
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
|
||||
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
|
||||
|
||||
|
||||
|
||||
@ -108,69 +128,86 @@ spec:
|
||||
content: |
|
||||
# Step 1. Prerequisites and environment
|
||||
|
||||
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
|
||||
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
|
||||
|
||||
```bash
|
||||
# Verify GPU and Docker
|
||||
nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
|
||||
```
|
||||
|
||||
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
|
||||
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
|
||||
|
||||
```bash
|
||||
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
```
|
||||
|
||||
# Step 2. Clone the playbook and set up nanochat
|
||||
# Step 2. Clone and set up
|
||||
|
||||
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
|
||||
Clone the playbook repository and navigate to the assets directory:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||||
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
|
||||
```
|
||||
|
||||
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
|
||||
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
|
||||
|
||||
```bash
|
||||
./setup.sh
|
||||
```
|
||||
|
||||
Setup may take several minutes while the image builds. Verify the image:
|
||||
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
|
||||
|
||||
```bash
|
||||
docker images | grep nanochat
|
||||
```
|
||||
assets/
|
||||
├── Dockerfile
|
||||
├── launch.sh
|
||||
├── setup.sh
|
||||
├── speedrun_station.sh
|
||||
└── nanochat/
|
||||
```
|
||||
|
||||
You should see the `nanochat` image listed.
|
||||
# Step 3. Launch training
|
||||
|
||||
# Step 3. Launch full training
|
||||
|
||||
> [!NOTE]
|
||||
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
|
||||
|
||||
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
|
||||
Ensure your API keys are exported, then launch:
|
||||
|
||||
```bash
|
||||
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
./launch_full.sh
|
||||
./launch.sh
|
||||
```
|
||||
|
||||
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
|
||||
The training runs inside the `nanochat` container and executes the full pipeline automatically:
|
||||
|
||||
# Step 4. Verify and use the model
|
||||
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
|
||||
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
|
||||
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
|
||||
4. **Report generation** — produces `report.md` with metrics and samples
|
||||
|
||||
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
|
||||
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
|
||||
|
||||
# Step 4. Monitor training
|
||||
|
||||
**W&B dashboard:**
|
||||
|
||||
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
|
||||
- Training loss
|
||||
- Validation BPB
|
||||
- Throughput (tokens/sec)
|
||||
|
||||
# Step 5. Inference
|
||||
|
||||
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
|
||||
|
||||
**Web UI (recommended):**
|
||||
|
||||
```bash
|
||||
cd nanochat
|
||||
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
|
||||
python -m scripts.chat_web
|
||||
docker run --rm --gpus all --net=host \
|
||||
-v $(pwd)/nanochat:/workspace/nanochat \
|
||||
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
python -m scripts.chat_web
|
||||
```
|
||||
|
||||
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Station’s IP address.
|
||||
@ -178,14 +215,15 @@ spec:
|
||||
**CLI:**
|
||||
|
||||
```bash
|
||||
cd nanochat
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
|
||||
docker run --rm -it --gpus all \
|
||||
-v $(pwd)/nanochat:/workspace/nanochat \
|
||||
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
```
|
||||
|
||||
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
|
||||
|
||||
# Step 5. Cleanup
|
||||
# Step 6. Cleanup
|
||||
|
||||
To stop training early, interrupt the launch script or stop the container:
|
||||
|
||||
@ -195,23 +233,32 @@ spec:
|
||||
```bash
|
||||
# If launch.sh is running: press Ctrl+C
|
||||
|
||||
# Or stop the container by name
|
||||
# Or stop the container directly
|
||||
docker stop $(docker ps -q --filter ancestor=nanochat)
|
||||
```
|
||||
|
||||
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
|
||||
To free disk space:
|
||||
|
||||
```bash
|
||||
rm -rf ./nanochat_cache ./hf_cache
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
# Step 6. Next steps and customization
|
||||
# Step 7. Customization
|
||||
|
||||
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
|
||||
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
|
||||
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
|
||||
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
|
||||
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
|
||||
|
||||
```bash
|
||||
# Fewer data shards (10 instead of default)
|
||||
python -m nanochat.dataset -n 10 &
|
||||
|
||||
# Smaller model (d4 instead of d24), smaller batch size
|
||||
python -m scripts.base_train --depth=4 --device-batch-size=32
|
||||
```
|
||||
|
||||
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
|
||||
|
||||
Then re-run `./setup.sh` to rebuild with the changes.
|
||||
|
||||
|
||||
|
||||
@ -221,14 +268,16 @@ spec:
|
||||
label: Troubleshooting
|
||||
content: |
|
||||
| Symptom | Cause | Fix |
|
||||
|--------|--------|-----|
|
||||
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
|
||||
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
|
||||
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
|
||||
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` don’t exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
|
||||
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
|
||||
| Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
|
||||
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|
||||
|---------|-------|-----|
|
||||
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
|
||||
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
|
||||
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
|
||||
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
|
||||
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
|
||||
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
|
||||
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
|
||||
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
|
||||
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |
|
||||
|
||||
|
||||
|
||||
|
||||
@ -87,7 +87,7 @@ spec:
|
||||
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
|
||||
- Docker installed with GPU support
|
||||
- NVIDIA Container Toolkit configured
|
||||
- Megatron-Bridge installed (via the the NeMo Framework NGC container)
|
||||
- Megatron-Bridge installed (via the NeMo Framework NGC container)
|
||||
|
||||
Verify your setup:
|
||||
|
||||
@ -139,7 +139,7 @@ spec:
|
||||
nvcr.io/nvidia/nemo:${TAG}
|
||||
```
|
||||
|
||||
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container** .
|
||||
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container**.
|
||||
|
||||
# Step 2. Review the pretraining script
|
||||
|
||||
@ -279,7 +279,7 @@ spec:
|
||||
| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism |
|
||||
| `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible |
|
||||
| Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate |
|
||||
| `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
|
||||
| `--disable-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
|
||||
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization |
|
||||
| Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
|
||||
|
||||
|
||||
@ -330,7 +330,7 @@ spec:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||||
cd dgx-station-playbooks/nvidia/station-sglang-inference
|
||||
cd dgx-spark-playbooks/nvidia/station-sglang-inference
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
|
||||
@ -1,8 +1,8 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-vllm
|
||||
displayName: Serve Qwen3-235B with vLLM
|
||||
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
|
||||
displayName: vLLM for Inference
|
||||
shortDescription: Install and use vLLM on DGX Station
|
||||
publisher: nvidia
|
||||
description: |
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
@ -15,7 +15,7 @@ metadata:
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
value: 20 MIN
|
||||
value: 30 MIN
|
||||
|
||||
spec:
|
||||
artifactName: station-vllm
|
||||
@ -42,7 +42,9 @@ spec:
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
|
||||
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
|
||||
|
||||
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
@ -57,21 +59,30 @@ spec:
|
||||
- HuggingFace account with access token
|
||||
- Network access to NGC and HuggingFace
|
||||
|
||||
# Model Support Matrix
|
||||
|
||||
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Duration:** 15-20 minutes (longer on first run due to model download)
|
||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||
* **Risks:** Model download requires HuggingFace authentication
|
||||
* **Rollback:** Stop and remove the container to restore state
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
* **Last Updated:** 05/28/2026
|
||||
* Update models
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: instructions
|
||||
|
||||
label: Serve Qwen3-235B
|
||||
label: Instructions
|
||||
content: |
|
||||
# Step 1. Set up Docker permissions
|
||||
|
||||
@ -92,7 +103,7 @@ spec:
|
||||
export HF_TOKEN="your_huggingface_token"
|
||||
|
||||
# Model to serve
|
||||
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
|
||||
export MODEL_HANDLE="<HF_HANDLE>"
|
||||
|
||||
# Maximum context length
|
||||
export MAX_MODEL_LEN=8192
|
||||
@ -106,9 +117,16 @@ spec:
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, pull the custom VLLM container
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:stepfun37
|
||||
```
|
||||
|
||||
# Step 4. Start vLLM server
|
||||
|
||||
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
|
||||
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
@ -126,6 +144,28 @@ spec:
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
vllm/vllm-openai:stepfun37 \
|
||||
"$MODEL_HANDLE" \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--trust-remote-code \
|
||||
--reasoning-parser step3p5 \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser step3p5 \
|
||||
--kv-cache-dtype fp8
|
||||
```
|
||||
|
||||
Check the server logs for startup progress:
|
||||
|
||||
```bash
|
||||
@ -135,7 +175,7 @@ spec:
|
||||
Expected output includes:
|
||||
- Model download progress (first run only)
|
||||
- Model loading into GPU memory
|
||||
- `Uvicorn running on http://0.0.0.0:8000`
|
||||
- `Application startup complete.`
|
||||
|
||||
Press `Ctrl+C` to exit log view once the server is ready.
|
||||
|
||||
@ -166,9 +206,10 @@ spec:
|
||||
|
||||
Optionally, remove the image and cached model:
|
||||
|
||||
Eg.
|
||||
```bash
|
||||
docker rmi nvcr.io/nvidia/vllm:26.01-py3
|
||||
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
|
||||
docker rmi "<docker image name>"
|
||||
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
|
||||
```
|
||||
|
||||
|
||||
|
||||
@ -1,8 +1,8 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-vllm
|
||||
displayName: vLLM for Inference
|
||||
shortDescription: Install and use vLLM on DGX Station
|
||||
displayName: Serve Qwen3-235B with vLLM
|
||||
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
|
||||
publisher: nvidia
|
||||
description: |
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
@ -15,7 +15,7 @@ metadata:
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
value: 30 MIN
|
||||
value: 20 MIN
|
||||
|
||||
spec:
|
||||
artifactName: station-vllm
|
||||
@ -42,9 +42,7 @@ spec:
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
|
||||
|
||||
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
|
||||
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
@ -59,30 +57,21 @@ spec:
|
||||
- HuggingFace account with access token
|
||||
- Network access to NGC and HuggingFace
|
||||
|
||||
# Model Support Matrix
|
||||
|
||||
The following models are supported with vLLM on Spark. All listed models are available and ready to use:
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||
* **Duration:** 15-20 minutes (longer on first run due to model download)
|
||||
* **Risks:** Model download requires HuggingFace authentication
|
||||
* **Rollback:** Stop and remove the container to restore state
|
||||
* **Last Updated:** 05/28/2026
|
||||
* Update models
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: instructions
|
||||
|
||||
label: Instructions
|
||||
label: Serve Qwen3-235B
|
||||
content: |
|
||||
# Step 1. Set up Docker permissions
|
||||
|
||||
@ -103,7 +92,7 @@ spec:
|
||||
export HF_TOKEN="your_huggingface_token"
|
||||
|
||||
# Model to serve
|
||||
export MODEL_HANDLE="<HF_HANDLE>"
|
||||
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
|
||||
|
||||
# Maximum context length
|
||||
export MAX_MODEL_LEN=8192
|
||||
@ -117,16 +106,9 @@ spec:
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, pull the custom VLLM container
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:stepfun37
|
||||
```
|
||||
|
||||
# Step 4. Start vLLM server
|
||||
|
||||
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
|
||||
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
|
||||
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
@ -144,28 +126,6 @@ spec:
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
vllm/vllm-openai:stepfun37 \
|
||||
"$MODEL_HANDLE" \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--trust-remote-code \
|
||||
--reasoning-parser step3p5 \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser step3p5 \
|
||||
--kv-cache-dtype fp8
|
||||
```
|
||||
|
||||
Check the server logs for startup progress:
|
||||
|
||||
```bash
|
||||
@ -175,7 +135,7 @@ spec:
|
||||
Expected output includes:
|
||||
- Model download progress (first run only)
|
||||
- Model loading into GPU memory
|
||||
- `Application startup complete.`
|
||||
- `Uvicorn running on http://0.0.0.0:8000`
|
||||
|
||||
Press `Ctrl+C` to exit log view once the server is ready.
|
||||
|
||||
@ -206,10 +166,9 @@ spec:
|
||||
|
||||
Optionally, remove the image and cached model:
|
||||
|
||||
Eg.
|
||||
```bash
|
||||
docker rmi "<docker image name>"
|
||||
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
|
||||
docker rmi nvcr.io/nvidia/vllm:26.01-py3
|
||||
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
|
||||
```
|
||||
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user