chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2026-05-29 15:56:45 +00:00
parent a9383bb067
commit 6942395d72
10 changed files with 362 additions and 312 deletions

View File

@ -15,18 +15,16 @@
## Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
## What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
## What to know before starting
@ -38,104 +36,143 @@ You will have a working nanochat setup that trains a small LLM and serves it for
**Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip.
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
**Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images.
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
## Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
## Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
## Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
All required assets are in `nvidia/station-nanochat/assets/`:
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
## Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
- **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
## Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
## Instructions
## Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash
## Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
```
## Step 2. Clone the playbook and set up nanochat
## Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
Clone the playbook repository and navigate to the assets directory:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash
./setup.sh
```
Setup may take several minutes while the image builds. Verify the image:
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash
docker images | grep nanochat
```
assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
```
You should see the `nanochat` image listed.
## Step 3. Launch training
## Step 3. Launch full training
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
Ensure your API keys are exported, then launch:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
./launch.sh
```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
The training runs inside the `nanochat` container and executes the full pipeline automatically:
## Step 4. Verify and use the model
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
## Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
## Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):**
```bash
cd nanochat
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
python -m scripts.chat_web
docker run --rm --gpus all --net=host \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -143,14 +180,15 @@ Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX St
**CLI:**
```bash
cd nanochat
python -m scripts.chat_cli -p "Why is the sky blue?"
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
docker run --rm -it --gpus all \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
## Step 5. Cleanup
## Step 6. Cleanup
To stop training early, interrupt the launch script or stop the container:
@ -160,32 +198,43 @@ To stop training early, interrupt the launch script or stop the container:
```bash
## If launch.sh is running: press Ctrl+C
## Or stop the container by name
## Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)
```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
To free disk space:
```bash
rm -rf ./nanochat_cache ./hf_cache
docker system prune -a
```
## Step 6. Next steps and customization
## Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
```bash
## Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
## Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
## Troubleshooting
| Symptom | Cause | Fix |
|--------|--------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

View File

@ -1,11 +1,15 @@
FROM nvcr.io/nvidia/pytorch:25.09-py3
FROM nvcr.io/nvidia/pytorch:26.04-py3
WORKDIR /workspace
# Install dependencies globally so torchrun (which uses /usr/bin/python) can access them
RUN /usr/bin/python -m pip install tiktoken tokenizers datasets psutil files-to-prompt regex setuptools uvicorn wandb maturin
# Create venv with --system-site-packages so it inherits global packages
RUN /usr/bin/python -m venv --system-site-packages .venv
RUN pip install \
datasets \
tokenizers \
wandb \
tiktoken \
psutil \
files-to-prompt \
uvicorn \
rustbpe
CMD ["/bin/bash"]

View File

@ -3,7 +3,6 @@
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Lite training (default). Runs speedrun.sh, which setup copies from speedrun_lite.sh.
# Get wandb API key
export WANDB_API_KEY=$WANDB_API_KEY
@ -11,7 +10,6 @@ if [ -z "$WANDB_API_KEY" ]; then
echo "WANDB_API_KEY is not set"
exit 1
fi
export WANDB_RUN=${WANDB_RUN:-speedrun}
# Get Hugging Face API key
@ -21,26 +19,23 @@ if [ -z "$HF_TOKEN" ]; then
exit 1
fi
# Cleanup function to stop containers
# Use local cache dirs so no root paths are required
workdir=$(pwd)
NANOCHAT_CACHE="$(pwd)/nanochat_cache"
HF_CACHE="$(pwd)/hf_cache"
cleanup() {
echo
echo "Stopping containers..."
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null || true
echo "Interrupted training!"
echo -e "\nStopping training container..."
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null
echo "Cleanup complete."
exit 0
}
workdir=$(pwd)
# DGX Station: use local cache dirs so no root paths are required
NANOCHAT_CACHE="${NANOCHAT_CACHE:-$(pwd)/nanochat_cache}"
HF_CACHE="${HF_CACHE:-$(pwd)/hf_cache}"
mkdir -p "$NANOCHAT_CACHE" "$HF_CACHE"
trap cleanup SIGINT SIGTERM
cmd="
mkdir -p /nanochat_cache && \
mkdir -p /hf_cache && \
chmod 777 /nanochat_cache && \
chmod 777 /hf_cache && \
# Launch Nanochat training
cmd="mkdir -p $NANOCHAT_CACHE $HF_CACHE && \
chmod u+rwx $NANOCHAT_CACHE $HF_CACHE && \
docker run \
--rm \
--runtime=nvidia \
@ -57,16 +52,8 @@ docker run \
-v $HF_CACHE:/root/.cache/huggingface \
-w /workspace/nanochat \
nanochat \
bash speedrun.sh"
bash runs/speedrun.sh"
sh -c "$cmd" &
sleep 5
while true; do
if ! docker ps | grep -q "nanochat"; then
echo
echo "Training complete!"
exit 0
fi
sleep 1
done
wait
echo -e "\nTraining complete!"

View File

@ -11,10 +11,10 @@ assets_dir="$(cd "$(dirname "$0")" && pwd)"
cmd="cd $workdir && \
git clone https://github.com/karpathy/nanochat.git && \
cd nanochat && \
git checkout c6b7ab744055d5915e6ccb61088de80c10cbaff9 && \
cp ../speedrun_spark.sh ./speedrun.sh && \
git checkout 0aaca56805eb13f6e6e1fff789a08086902f12ab && \
cp ../speedrun_station.sh ./runs/speedrun.sh && \
cd .. && \
chmod +x launch_full.sh 2>/dev/null || true && \
chmod +x launch.sh 2>/dev/null || true && \
docker build -t nanochat ."
sh -c "$cmd"

View File

@ -1,15 +1,14 @@
#!/bin/bash
set -e
# This script is the "Best ChatGPT clone that $100 can buy",
# It is designed to run in ~4 hours on 8XH100 node at $3/GPU/hour.
# This script is configured to train your own GPT-2 grade LLM (pretraining + finetuning)
# It is designed to run on a blank 8XH100 GPU node and takes approximately 3 hours to complete.
# 1) Example launch (simplest):
# bash speedrun.sh
# 2) Example launch in a screen session (because the run takes ~4 hours):
# screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
# bash runs/speedrun.sh
# 2) Example launch in a screen session (because the run takes ~3 hours):
# screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
# 3) Example launch with wandb logging, but see below for setting up wandb first:
# WANDB_RUN=speedrun screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
# WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
# Default intermediate artifacts directory is in ~/.cache/nanochat
export OMP_NUM_THREADS=1
@ -26,7 +25,7 @@ mkdir -p $NANOCHAT_BASE_DIR
# install the repo dependencies
# uv sync --extra gpu
# activate venv so that `python` uses the project's venv instead of system python
source ../.venv/bin/activate
# source .venv/bin/activate
# -----------------------------------------------------------------------------
# wandb setup
@ -49,70 +48,41 @@ python -m nanochat.report reset
# -----------------------------------------------------------------------------
# Tokenizer
# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
# Build the rustbpe Tokenizer
# unset VIRTUAL_ENV
maturin develop --release --manifest-path rustbpe/Cargo.toml
# Download the first ~2B characters of pretraining dataset
# look at dev/repackage_data_reference.py for details on how this data was prepared
# each data shard is ~250M chars
# so we download 2e9 / 250e6 = 8 data shards at this point
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
# look at dev/repackage_data_reference.py for details on how this data was prepared
python -m nanochat.dataset -n 8
# Immediately also kick off downloading more shards in the background while tokenizer trains
# See comment below for why 240 is the right number here
python -m nanochat.dataset -n 240 &
# Approximately 150 shards are needed for GPT-2 capability pretraining, add 20 for padding.
# The maximum total number of shards available in the entire dataset is 6542.
python -m nanochat.dataset -n 170 &
DATASET_DOWNLOAD_PID=$!
# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000
# train the tokenizer with vocab size 2**15 = 32768 on ~2B characters of data
python -m scripts.tok_train
# evaluate the tokenizer (report compression ratio etc.)
python -m scripts.tok_eval
# -----------------------------------------------------------------------------
# Base model (pretraining)
# The d20 model is 561M parameters.
# Chinchilla says #tokens = 20X #params, so we need 561e6 * 20 = 11.2B tokens.
# Assume our tokenizer is 4.8 chars/token, this is 11.2B * 4.8 ~= 54B chars.
# At 250M chars/shard, this is 54B / 250M ~= 216 shards needed for pretraining.
# Round up to 240 for safety. At ~100MB/shard, this downloads ~24GB of data to disk.
# (The total number of shards available in the entire dataset is 1822.)
echo "Waiting for dataset download to complete..."
wait $DATASET_DOWNLOAD_PID
source ../.venv/bin/activate
# pretrain the d20 model
python -m scripts.base_train --depth=20 --run=$WANDB_RUN
# evaluate the model on a larger chunk of train/val data and draw some samples
python -m scripts.base_loss
# evaluate the model on CORE tasks
python -m scripts.base_eval
sleep 5
# d24 model (slightly undertrained to beat GPT-2 => decrease data:params ratio from compute optimal 10.5 (default) to 8)
python -m scripts.base_train --depth=24 --target-param-data-ratio=8 --device-batch-size=64 --fp8 --run=$WANDB_RUN
# evaluate the model: CORE metric, BPB on train/val, and draw samples
python -m scripts.base_eval --device-batch-size=64
# -----------------------------------------------------------------------------
# Midtraining (teach the model conversation special tokens, tool use, multiple choice)
# SFT (teach the model conversation special tokens, tool use, multiple choice)
# download 2.3MB of synthetic identity conversations to impart a personality to nanochat
# see dev/gen_sft_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
# see dev/gen_synthetic_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
# run midtraining and eval the model
python -m scripts.mid_train --run=$WANDB_RUN
python -m scripts.chat_eval -i mid
sleep 5
# -----------------------------------------------------------------------------
# Supervised Finetuning (domain adaptation to each sequence all by itself per row)
# train sft and re-eval right away (should see a small bump)
python -m scripts.chat_sft --run=$WANDB_RUN
# run SFT and eval the model
python -m scripts.chat_sft --device-batch-size=64 --run=$WANDB_RUN
python -m scripts.chat_eval -i sft
# chat with the model over CLI! Leave out the -p to chat interactively
@ -121,15 +91,6 @@ python -m scripts.chat_eval -i sft
# even better, chat with your model over a pretty WebUI ChatGPT style
# python -m scripts.chat_web
# -----------------------------------------------------------------------------
# Reinforcement Learning. Optional, and currently only on GSM8K
# (optional)
# run reinforcement learning
# python -m scripts.chat_rl --run=$WANDB_RUN
# eval the RL model only on GSM8K
# python -m scripts.chat_eval --i rl -a GSM8K
# -----------------------------------------------------------------------------
# Generate the full report by putting together all the sections
# report.md is the output and will be copied to current directory for convenience

View File

@ -45,18 +45,16 @@ spec:
content: |
# Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
# What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
# What to know before starting
@ -68,36 +66,58 @@ spec:
**Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip.
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
**Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images.
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
# Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
# Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
# Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
All required assets are in `nvidia/station-nanochat/assets/`:
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
# Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
- **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
# Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
@ -108,69 +128,86 @@ spec:
content: |
# Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash
# Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
```
# Step 2. Clone the playbook and set up nanochat
# Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
Clone the playbook repository and navigate to the assets directory:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash
./setup.sh
```
Setup may take several minutes while the image builds. Verify the image:
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash
docker images | grep nanochat
```
assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
```
You should see the `nanochat` image listed.
# Step 3. Launch training
# Step 3. Launch full training
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
Ensure your API keys are exported, then launch:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
./launch.sh
```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
The training runs inside the `nanochat` container and executes the full pipeline automatically:
# Step 4. Verify and use the model
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
# Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
# Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):**
```bash
cd nanochat
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
python -m scripts.chat_web
docker run --rm --gpus all --net=host \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -178,14 +215,15 @@ spec:
**CLI:**
```bash
cd nanochat
python -m scripts.chat_cli -p "Why is the sky blue?"
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
docker run --rm -it --gpus all \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
# Step 5. Cleanup
# Step 6. Cleanup
To stop training early, interrupt the launch script or stop the container:
@ -195,23 +233,32 @@ spec:
```bash
# If launch.sh is running: press Ctrl+C
# Or stop the container by name
# Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)
```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
To free disk space:
```bash
rm -rf ./nanochat_cache ./hf_cache
docker system prune -a
```
# Step 6. Next steps and customization
# Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
```bash
# Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
@ -221,14 +268,16 @@ spec:
label: Troubleshooting
content: |
| Symptom | Cause | Fix |
|--------|--------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

View File

@ -87,7 +87,7 @@ spec:
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
- Docker installed with GPU support
- NVIDIA Container Toolkit configured
- Megatron-Bridge installed (via the the NeMo Framework NGC container)
- Megatron-Bridge installed (via the NeMo Framework NGC container)
Verify your setup:
@ -139,7 +139,7 @@ spec:
nvcr.io/nvidia/nemo:${TAG}
```
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container** .
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container**.
# Step 2. Review the pretraining script
@ -279,7 +279,7 @@ spec:
| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism |
| `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible |
| Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate |
| `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
| `--disable-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization |
| Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |

View File

@ -330,7 +330,7 @@ spec:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-station-playbooks/nvidia/station-sglang-inference
cd dgx-spark-playbooks/nvidia/station-sglang-inference
```
> [!TIP]

View File

@ -1,8 +1,8 @@
kind: Playbook
metadata:
name: station-vllm
displayName: Serve Qwen3-235B with vLLM
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
displayName: vLLM for Inference
shortDescription: Install and use vLLM on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes:
- key: DURATION
value: 20 MIN
value: 30 MIN
spec:
artifactName: station-vllm
@ -42,7 +42,9 @@ spec:
# What you'll accomplish
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
# What to know before starting
@ -57,21 +59,30 @@ spec:
- HuggingFace account with access token
- Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
# Time & risk
* **Duration:** 15-20 minutes (longer on first run due to model download)
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 03/02/2026
* First Publication
* **Last Updated:** 05/28/2026
* Update models
-
id: instructions
label: Serve Qwen3-235B
label: Instructions
content: |
# Step 1. Set up Docker permissions
@ -92,7 +103,7 @@ spec:
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
export MODEL_HANDLE="<HF_HANDLE>"
# Maximum context length
export MAX_MODEL_LEN=8192
@ -106,9 +117,16 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
# Step 4. Start vLLM server
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
```bash
docker run -d \
@ -126,6 +144,28 @@ spec:
--gpu-memory-utilization 0.9
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress:
```bash
@ -135,7 +175,7 @@ spec:
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Uvicorn running on http://0.0.0.0:8000`
- `Application startup complete.`
Press `Ctrl+C` to exit log view once the server is ready.
@ -166,9 +206,10 @@ spec:
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
```

View File

@ -1,8 +1,8 @@
kind: Playbook
metadata:
name: station-vllm
displayName: vLLM for Inference
shortDescription: Install and use vLLM on DGX Station
displayName: Serve Qwen3-235B with vLLM
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes:
- key: DURATION
value: 30 MIN
value: 20 MIN
spec:
artifactName: station-vllm
@ -42,9 +42,7 @@ spec:
# What you'll accomplish
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
# What to know before starting
@ -59,30 +57,21 @@ spec:
- HuggingFace account with access token
- Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on Spark. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
# Time & risk
* **Duration:** 30 minutes (longer on first run due to model download)
* **Duration:** 15-20 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
* Update models
* **Last Updated:** 03/02/2026
* First Publication
-
id: instructions
label: Instructions
label: Serve Qwen3-235B
content: |
# Step 1. Set up Docker permissions
@ -103,7 +92,7 @@ spec:
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="<HF_HANDLE>"
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
# Maximum context length
export MAX_MODEL_LEN=8192
@ -117,16 +106,9 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
# Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
```bash
docker run -d \
@ -144,28 +126,6 @@ spec:
--gpu-memory-utilization 0.9
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress:
```bash
@ -175,7 +135,7 @@ spec:
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Application startup complete.`
- `Uvicorn running on http://0.0.0.0:8000`
Press `Ctrl+C` to exit log view once the server is ready.
@ -206,10 +166,9 @@ spec:
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
```