chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2026-05-29 15:56:45 +00:00
parent a9383bb067
commit 6942395d72
10 changed files with 362 additions and 312 deletions

View File

@ -15,18 +15,16 @@
## Basic idea ## Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI. This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation. The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
## What you'll accomplish ## What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat. - **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station. - **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation. - **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
## What to know before starting ## What to know before starting
@ -38,104 +36,143 @@ You will have a working nanochat setup that trains a small LLM and serves it for
**Hardware:** **Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip. - NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun). - Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
**Software:** **Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi` - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images. - Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key. - [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets. - [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
## Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
## Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
## Ancillary files ## Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). All required assets are in `nvidia/station-nanochat/assets/`:
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
## Time & risk ## Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more). - **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium - **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space. - Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit. - API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed. - **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication ## Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
## Instructions ## Instructions
## Step 1. Prerequisites and environment ## Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash ```bash
## Verify GPU and Docker ## Verify GPU and Docker
nvidia-smi nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
``` ```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash ```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN> export HF_TOKEN=<YOUR_HF_TOKEN>
``` ```
## Step 2. Clone the playbook and set up nanochat ## Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image. Clone the playbook repository and navigate to the assets directory:
```bash ```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets cd dgx-spark-playbooks/nvidia/station-nanochat/assets
``` ```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.). Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash ```bash
./setup.sh ./setup.sh
``` ```
Setup may take several minutes while the image builds. Verify the image: You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash ```
docker images | grep nanochat assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
``` ```
You should see the `nanochat` image listed. ## Step 3. Launch training
## Step 3. Launch full training Ensure your API keys are exported, then launch:
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
```bash ```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> ./launch.sh
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
``` ```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation. The training runs inside the `nanochat` container and executes the full pipeline automatically:
## Step 4. Verify and use the model 1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station. Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
## Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
## Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):** **Web UI (recommended):**
```bash ```bash
cd nanochat docker run --rm --gpus all --net=host \
source ../.venv/bin/activate # if using venv from container context; otherwise use the container -v $(pwd)/nanochat:/workspace/nanochat \
python -m scripts.chat_web -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
``` ```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address. Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -143,14 +180,15 @@ Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX St
**CLI:** **CLI:**
```bash ```bash
cd nanochat docker run --rm -it --gpus all \
python -m scripts.chat_cli -p "Why is the sky blue?" -v $(pwd)/nanochat:/workspace/nanochat \
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning" -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
``` ```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project. ## Step 6. Cleanup
## Step 5. Cleanup
To stop training early, interrupt the launch script or stop the container: To stop training early, interrupt the launch script or stop the container:
@ -160,32 +198,43 @@ To stop training early, interrupt the launch script or stop the container:
```bash ```bash
## If launch.sh is running: press Ctrl+C ## If launch.sh is running: press Ctrl+C
## Or stop the container by name ## Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat) docker stop $(docker ps -q --filter ancestor=nanochat)
``` ```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`): To free disk space:
```bash ```bash
rm -rf ./nanochat_cache ./hf_cache rm -rf ./nanochat_cache ./hf_cache
docker system prune -a docker system prune -a
``` ```
## Step 6. Next steps and customization ## Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time. **Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput). ```bash
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`. ## Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
## Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
## Troubleshooting ## Troubleshooting
| Symptom | Cause | Fix | | Symptom | Cause | Fix |
|--------|--------|-----| |---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. | | `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). | | `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. | | Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. | | `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). | | `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. | | Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. | | W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

View File

@ -1,11 +1,15 @@
FROM nvcr.io/nvidia/pytorch:25.09-py3 FROM nvcr.io/nvidia/pytorch:26.04-py3
WORKDIR /workspace WORKDIR /workspace
# Install dependencies globally so torchrun (which uses /usr/bin/python) can access them RUN pip install \
RUN /usr/bin/python -m pip install tiktoken tokenizers datasets psutil files-to-prompt regex setuptools uvicorn wandb maturin datasets \
tokenizers \
# Create venv with --system-site-packages so it inherits global packages wandb \
RUN /usr/bin/python -m venv --system-site-packages .venv tiktoken \
psutil \
files-to-prompt \
uvicorn \
rustbpe
CMD ["/bin/bash"] CMD ["/bin/bash"]

View File

@ -3,7 +3,6 @@
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# #
# Lite training (default). Runs speedrun.sh, which setup copies from speedrun_lite.sh.
# Get wandb API key # Get wandb API key
export WANDB_API_KEY=$WANDB_API_KEY export WANDB_API_KEY=$WANDB_API_KEY
@ -11,7 +10,6 @@ if [ -z "$WANDB_API_KEY" ]; then
echo "WANDB_API_KEY is not set" echo "WANDB_API_KEY is not set"
exit 1 exit 1
fi fi
export WANDB_RUN=${WANDB_RUN:-speedrun} export WANDB_RUN=${WANDB_RUN:-speedrun}
# Get Hugging Face API key # Get Hugging Face API key
@ -21,26 +19,23 @@ if [ -z "$HF_TOKEN" ]; then
exit 1 exit 1
fi fi
# Cleanup function to stop containers # Use local cache dirs so no root paths are required
workdir=$(pwd)
NANOCHAT_CACHE="$(pwd)/nanochat_cache"
HF_CACHE="$(pwd)/hf_cache"
cleanup() { cleanup() {
echo echo -e "\nStopping training container..."
echo "Stopping containers..." docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null || true echo "Cleanup complete."
echo "Interrupted training!"
exit 0 exit 0
} }
workdir=$(pwd) trap cleanup SIGINT SIGTERM
# DGX Station: use local cache dirs so no root paths are required
NANOCHAT_CACHE="${NANOCHAT_CACHE:-$(pwd)/nanochat_cache}"
HF_CACHE="${HF_CACHE:-$(pwd)/hf_cache}"
mkdir -p "$NANOCHAT_CACHE" "$HF_CACHE"
cmd=" # Launch Nanochat training
mkdir -p /nanochat_cache && \ cmd="mkdir -p $NANOCHAT_CACHE $HF_CACHE && \
mkdir -p /hf_cache && \ chmod u+rwx $NANOCHAT_CACHE $HF_CACHE && \
chmod 777 /nanochat_cache && \
chmod 777 /hf_cache && \
docker run \ docker run \
--rm \ --rm \
--runtime=nvidia \ --runtime=nvidia \
@ -57,16 +52,8 @@ docker run \
-v $HF_CACHE:/root/.cache/huggingface \ -v $HF_CACHE:/root/.cache/huggingface \
-w /workspace/nanochat \ -w /workspace/nanochat \
nanochat \ nanochat \
bash speedrun.sh" bash runs/speedrun.sh"
sh -c "$cmd" & sh -c "$cmd" &
sleep 5 wait
while true; do echo -e "\nTraining complete!"
if ! docker ps | grep -q "nanochat"; then
echo
echo "Training complete!"
exit 0
fi
sleep 1
done

View File

@ -11,10 +11,10 @@ assets_dir="$(cd "$(dirname "$0")" && pwd)"
cmd="cd $workdir && \ cmd="cd $workdir && \
git clone https://github.com/karpathy/nanochat.git && \ git clone https://github.com/karpathy/nanochat.git && \
cd nanochat && \ cd nanochat && \
git checkout c6b7ab744055d5915e6ccb61088de80c10cbaff9 && \ git checkout 0aaca56805eb13f6e6e1fff789a08086902f12ab && \
cp ../speedrun_spark.sh ./speedrun.sh && \ cp ../speedrun_station.sh ./runs/speedrun.sh && \
cd .. && \ cd .. && \
chmod +x launch_full.sh 2>/dev/null || true && \ chmod +x launch.sh 2>/dev/null || true && \
docker build -t nanochat ." docker build -t nanochat ."
sh -c "$cmd" sh -c "$cmd"

View File

@ -1,15 +1,14 @@
#!/bin/bash #!/bin/bash
set -e
# This script is the "Best ChatGPT clone that $100 can buy", # This script is configured to train your own GPT-2 grade LLM (pretraining + finetuning)
# It is designed to run in ~4 hours on 8XH100 node at $3/GPU/hour. # It is designed to run on a blank 8XH100 GPU node and takes approximately 3 hours to complete.
# 1) Example launch (simplest): # 1) Example launch (simplest):
# bash speedrun.sh # bash runs/speedrun.sh
# 2) Example launch in a screen session (because the run takes ~4 hours): # 2) Example launch in a screen session (because the run takes ~3 hours):
# screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh # screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
# 3) Example launch with wandb logging, but see below for setting up wandb first: # 3) Example launch with wandb logging, but see below for setting up wandb first:
# WANDB_RUN=speedrun screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh # WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
# Default intermediate artifacts directory is in ~/.cache/nanochat # Default intermediate artifacts directory is in ~/.cache/nanochat
export OMP_NUM_THREADS=1 export OMP_NUM_THREADS=1
@ -26,7 +25,7 @@ mkdir -p $NANOCHAT_BASE_DIR
# install the repo dependencies # install the repo dependencies
# uv sync --extra gpu # uv sync --extra gpu
# activate venv so that `python` uses the project's venv instead of system python # activate venv so that `python` uses the project's venv instead of system python
source ../.venv/bin/activate # source .venv/bin/activate
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# wandb setup # wandb setup
@ -49,70 +48,41 @@ python -m nanochat.report reset
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# Tokenizer # Tokenizer
# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
# Build the rustbpe Tokenizer
# unset VIRTUAL_ENV
maturin develop --release --manifest-path rustbpe/Cargo.toml
# Download the first ~2B characters of pretraining dataset # Download the first ~2B characters of pretraining dataset
# look at dev/repackage_data_reference.py for details on how this data was prepared
# each data shard is ~250M chars # each data shard is ~250M chars
# so we download 2e9 / 250e6 = 8 data shards at this point # so we download 2e9 / 250e6 = 8 data shards at this point
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk # each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
# look at dev/repackage_data_reference.py for details on how this data was prepared
python -m nanochat.dataset -n 8 python -m nanochat.dataset -n 8
# Immediately also kick off downloading more shards in the background while tokenizer trains # Immediately also kick off downloading more shards in the background while tokenizer trains
# See comment below for why 240 is the right number here # Approximately 150 shards are needed for GPT-2 capability pretraining, add 20 for padding.
python -m nanochat.dataset -n 240 & # The maximum total number of shards available in the entire dataset is 6542.
python -m nanochat.dataset -n 170 &
DATASET_DOWNLOAD_PID=$! DATASET_DOWNLOAD_PID=$!
# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data # train the tokenizer with vocab size 2**15 = 32768 on ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000 python -m scripts.tok_train
# evaluate the tokenizer (report compression ratio etc.) # evaluate the tokenizer (report compression ratio etc.)
python -m scripts.tok_eval python -m scripts.tok_eval
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# Base model (pretraining) # Base model (pretraining)
# The d20 model is 561M parameters.
# Chinchilla says #tokens = 20X #params, so we need 561e6 * 20 = 11.2B tokens.
# Assume our tokenizer is 4.8 chars/token, this is 11.2B * 4.8 ~= 54B chars.
# At 250M chars/shard, this is 54B / 250M ~= 216 shards needed for pretraining.
# Round up to 240 for safety. At ~100MB/shard, this downloads ~24GB of data to disk.
# (The total number of shards available in the entire dataset is 1822.)
echo "Waiting for dataset download to complete..." echo "Waiting for dataset download to complete..."
wait $DATASET_DOWNLOAD_PID wait $DATASET_DOWNLOAD_PID
source ../.venv/bin/activate # d24 model (slightly undertrained to beat GPT-2 => decrease data:params ratio from compute optimal 10.5 (default) to 8)
python -m scripts.base_train --depth=24 --target-param-data-ratio=8 --device-batch-size=64 --fp8 --run=$WANDB_RUN
# pretrain the d20 model # evaluate the model: CORE metric, BPB on train/val, and draw samples
python -m scripts.base_train --depth=20 --run=$WANDB_RUN python -m scripts.base_eval --device-batch-size=64
# evaluate the model on a larger chunk of train/val data and draw some samples
python -m scripts.base_loss
# evaluate the model on CORE tasks
python -m scripts.base_eval
sleep 5
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# Midtraining (teach the model conversation special tokens, tool use, multiple choice) # SFT (teach the model conversation special tokens, tool use, multiple choice)
# download 2.3MB of synthetic identity conversations to impart a personality to nanochat # download 2.3MB of synthetic identity conversations to impart a personality to nanochat
# see dev/gen_sft_data.py for details on how this data was prepared and to get a sense of how you can easily tune it # see dev/gen_synthetic_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
# run midtraining and eval the model # run SFT and eval the model
python -m scripts.mid_train --run=$WANDB_RUN python -m scripts.chat_sft --device-batch-size=64 --run=$WANDB_RUN
python -m scripts.chat_eval -i mid
sleep 5
# -----------------------------------------------------------------------------
# Supervised Finetuning (domain adaptation to each sequence all by itself per row)
# train sft and re-eval right away (should see a small bump)
python -m scripts.chat_sft --run=$WANDB_RUN
python -m scripts.chat_eval -i sft python -m scripts.chat_eval -i sft
# chat with the model over CLI! Leave out the -p to chat interactively # chat with the model over CLI! Leave out the -p to chat interactively
@ -121,15 +91,6 @@ python -m scripts.chat_eval -i sft
# even better, chat with your model over a pretty WebUI ChatGPT style # even better, chat with your model over a pretty WebUI ChatGPT style
# python -m scripts.chat_web # python -m scripts.chat_web
# -----------------------------------------------------------------------------
# Reinforcement Learning. Optional, and currently only on GSM8K
# (optional)
# run reinforcement learning
# python -m scripts.chat_rl --run=$WANDB_RUN
# eval the RL model only on GSM8K
# python -m scripts.chat_eval --i rl -a GSM8K
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# Generate the full report by putting together all the sections # Generate the full report by putting together all the sections
# report.md is the output and will be copied to current directory for convenience # report.md is the output and will be copied to current directory for convenience

View File

@ -45,18 +45,16 @@ spec:
content: | content: |
# Basic idea # Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI. This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation. The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
# What you'll accomplish # What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat. - **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station. - **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation. - **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
# What to know before starting # What to know before starting
@ -68,36 +66,58 @@ spec:
**Hardware:** **Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip. - NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun). - Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
**Software:** **Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi` - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images. - Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key. - [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets. - [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
# Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
# Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
# Ancillary files # Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). All required assets are in `nvidia/station-nanochat/assets/`:
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
# Time & risk # Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more). - **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium - **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space. - Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit. - API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed. - **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication # Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
@ -108,69 +128,86 @@ spec:
content: | content: |
# Step 1. Prerequisites and environment # Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash ```bash
# Verify GPU and Docker # Verify GPU and Docker
nvidia-smi nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
``` ```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash ```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN> export HF_TOKEN=<YOUR_HF_TOKEN>
``` ```
# Step 2. Clone the playbook and set up nanochat # Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image. Clone the playbook repository and navigate to the assets directory:
```bash ```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets cd dgx-spark-playbooks/nvidia/station-nanochat/assets
``` ```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.). Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash ```bash
./setup.sh ./setup.sh
``` ```
Setup may take several minutes while the image builds. Verify the image: You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash ```
docker images | grep nanochat assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
``` ```
You should see the `nanochat` image listed. # Step 3. Launch training
# Step 3. Launch full training Ensure your API keys are exported, then launch:
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
```bash ```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> ./launch.sh
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
``` ```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation. The training runs inside the `nanochat` container and executes the full pipeline automatically:
# Step 4. Verify and use the model 1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station. Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
# Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
# Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):** **Web UI (recommended):**
```bash ```bash
cd nanochat docker run --rm --gpus all --net=host \
source ../.venv/bin/activate # if using venv from container context; otherwise use the container -v $(pwd)/nanochat:/workspace/nanochat \
python -m scripts.chat_web -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
``` ```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address. Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -178,14 +215,15 @@ spec:
**CLI:** **CLI:**
```bash ```bash
cd nanochat docker run --rm -it --gpus all \
python -m scripts.chat_cli -p "Why is the sky blue?" -v $(pwd)/nanochat:/workspace/nanochat \
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning" -v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
``` ```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project. # Step 6. Cleanup
# Step 5. Cleanup
To stop training early, interrupt the launch script or stop the container: To stop training early, interrupt the launch script or stop the container:
@ -195,23 +233,32 @@ spec:
```bash ```bash
# If launch.sh is running: press Ctrl+C # If launch.sh is running: press Ctrl+C
# Or stop the container by name # Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat) docker stop $(docker ps -q --filter ancestor=nanochat)
``` ```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`): To free disk space:
```bash ```bash
rm -rf ./nanochat_cache ./hf_cache rm -rf ./nanochat_cache ./hf_cache
docker system prune -a docker system prune -a
``` ```
# Step 6. Next steps and customization # Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time. **Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput). ```bash
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`. # Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
@ -221,14 +268,16 @@ spec:
label: Troubleshooting label: Troubleshooting
content: | content: |
| Symptom | Cause | Fix | | Symptom | Cause | Fix |
|--------|--------|-----| |---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. | | `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). | | `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. | | Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. | | `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). | | `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. | | Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. | | W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

View File

@ -87,7 +87,7 @@ spec:
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip) - NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
- Docker installed with GPU support - Docker installed with GPU support
- NVIDIA Container Toolkit configured - NVIDIA Container Toolkit configured
- Megatron-Bridge installed (via the the NeMo Framework NGC container) - Megatron-Bridge installed (via the NeMo Framework NGC container)
Verify your setup: Verify your setup:
@ -139,7 +139,7 @@ spec:
nvcr.io/nvidia/nemo:${TAG} nvcr.io/nvidia/nemo:${TAG}
``` ```
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container** . All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container**.
# Step 2. Review the pretraining script # Step 2. Review the pretraining script
@ -279,7 +279,7 @@ spec:
| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | | `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism |
| `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible | | `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible |
| Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate | | Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate |
| `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` | | `--disable-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization | | Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization |
| Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | | Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |

View File

@ -330,7 +330,7 @@ spec:
```bash ```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-station-playbooks/nvidia/station-sglang-inference cd dgx-spark-playbooks/nvidia/station-sglang-inference
``` ```
> [!TIP] > [!TIP]

View File

@ -1,8 +1,8 @@
kind: Playbook kind: Playbook
metadata: metadata:
name: station-vllm name: station-vllm
displayName: Serve Qwen3-235B with vLLM displayName: vLLM for Inference
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station shortDescription: Install and use vLLM on DGX Station
publisher: nvidia publisher: nvidia
description: | description: |
# REPLACE THIS WITH YOUR MODEL CARD # REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes: attributes:
- key: DURATION - key: DURATION
value: 20 MIN value: 30 MIN
spec: spec:
artifactName: station-vllm artifactName: station-vllm
@ -42,7 +42,9 @@ spec:
# What you'll accomplish # What you'll accomplish
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU. Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
# What to know before starting # What to know before starting
@ -57,21 +59,30 @@ spec:
- HuggingFace account with access token - HuggingFace account with access token
- Network access to NGC and HuggingFace - Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
# Time & risk # Time & risk
* **Duration:** 15-20 minutes (longer on first run due to model download) * **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication * **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state * **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 03/02/2026 * **Last Updated:** 05/28/2026
* First Publication * Update models
- -
id: instructions id: instructions
label: Serve Qwen3-235B label: Instructions
content: | content: |
# Step 1. Set up Docker permissions # Step 1. Set up Docker permissions
@ -92,7 +103,7 @@ spec:
export HF_TOKEN="your_huggingface_token" export HF_TOKEN="your_huggingface_token"
# Model to serve # Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4" export MODEL_HANDLE="<HF_HANDLE>"
# Maximum context length # Maximum context length
export MAX_MODEL_LEN=8192 export MAX_MODEL_LEN=8192
@ -106,9 +117,16 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3 docker pull nvcr.io/nvidia/vllm:26.01-py3
``` ```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
# Step 4. Start vLLM server # Step 4. Start vLLM server
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
```bash ```bash
docker run -d \ docker run -d \
@ -126,6 +144,28 @@ spec:
--gpu-memory-utilization 0.9 --gpu-memory-utilization 0.9
``` ```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress: Check the server logs for startup progress:
```bash ```bash
@ -135,7 +175,7 @@ spec:
Expected output includes: Expected output includes:
- Model download progress (first run only) - Model download progress (first run only)
- Model loading into GPU memory - Model loading into GPU memory
- `Uvicorn running on http://0.0.0.0:8000` - `Application startup complete.`
Press `Ctrl+C` to exit log view once the server is ready. Press `Ctrl+C` to exit log view once the server is ready.
@ -166,9 +206,10 @@ spec:
Optionally, remove the image and cached model: Optionally, remove the image and cached model:
Eg.
```bash ```bash
docker rmi nvcr.io/nvidia/vllm:26.01-py3 docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4 rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
``` ```

View File

@ -1,8 +1,8 @@
kind: Playbook kind: Playbook
metadata: metadata:
name: station-vllm name: station-vllm
displayName: vLLM for Inference displayName: Serve Qwen3-235B with vLLM
shortDescription: Install and use vLLM on DGX Station shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
publisher: nvidia publisher: nvidia
description: | description: |
# REPLACE THIS WITH YOUR MODEL CARD # REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes: attributes:
- key: DURATION - key: DURATION
value: 30 MIN value: 20 MIN
spec: spec:
artifactName: station-vllm artifactName: station-vllm
@ -42,9 +42,7 @@ spec:
# What you'll accomplish # What you'll accomplish
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models. Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
# What to know before starting # What to know before starting
@ -59,30 +57,21 @@ spec:
- HuggingFace account with access token - HuggingFace account with access token
- Network access to NGC and HuggingFace - Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on Spark. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
# Time & risk # Time & risk
* **Duration:** 30 minutes (longer on first run due to model download) * **Duration:** 15-20 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication * **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state * **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026 * **Last Updated:** 03/02/2026
* Update models * First Publication
- -
id: instructions id: instructions
label: Instructions label: Serve Qwen3-235B
content: | content: |
# Step 1. Set up Docker permissions # Step 1. Set up Docker permissions
@ -103,7 +92,7 @@ spec:
export HF_TOKEN="your_huggingface_token" export HF_TOKEN="your_huggingface_token"
# Model to serve # Model to serve
export MODEL_HANDLE="<HF_HANDLE>" export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
# Maximum context length # Maximum context length
export MAX_MODEL_LEN=8192 export MAX_MODEL_LEN=8192
@ -117,16 +106,9 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3 docker pull nvcr.io/nvidia/vllm:26.01-py3
``` ```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
# Step 4. Start vLLM server # Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
```bash ```bash
docker run -d \ docker run -d \
@ -144,28 +126,6 @@ spec:
--gpu-memory-utilization 0.9 --gpu-memory-utilization 0.9
``` ```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress: Check the server logs for startup progress:
```bash ```bash
@ -175,7 +135,7 @@ spec:
Expected output includes: Expected output includes:
- Model download progress (first run only) - Model download progress (first run only)
- Model loading into GPU memory - Model loading into GPU memory
- `Application startup complete.` - `Uvicorn running on http://0.0.0.0:8000`
Press `Ctrl+C` to exit log view once the server is ready. Press `Ctrl+C` to exit log view once the server is ready.
@ -206,10 +166,9 @@ spec:
Optionally, remove the image and cached model: Optionally, remove the image and cached model:
Eg.
```bash ```bash
docker rmi "<docker image name>" docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>" rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
``` ```