diff --git a/nvidia/station-nanochat/README.md b/nvidia/station-nanochat/README.md index 1a5c069..29a4fd2 100644 --- a/nvidia/station-nanochat/README.md +++ b/nvidia/station-nanochat/README.md @@ -15,18 +15,16 @@ ## Basic idea -This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI. +This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI. -The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation. +The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision. ## What you'll accomplish -You will have a working nanochat setup that trains a small LLM and serves it for chat. - -- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station. -- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation. -- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints. -- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples. +- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station. +- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation. +- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints. +- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples. ## What to know before starting @@ -38,104 +36,143 @@ You will have a working nanochat setup that trains a small LLM and serves it for **Hardware:** -- NVIDIA DGX Station with GB300 Ultra Superchip. -- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun). -- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints). +- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM). +- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints). **Software:** -- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi` -- Network access to download datasets (Hugging Face, FineWeb) and container images. +- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi` +- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io) - [Weights & Biases](https://wandb.ai/) account and API key. - [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets. +## Model architecture (d24) + +``` +Layers: 24 +Attention Heads: 12 +Head Dimension: 128 +Context Length: 2048 tokens +Vocabulary Size: 65,536 (2^16, trained BPE) +Precision: FP8 (e4m3, tensorwise scaling) +``` + +## Training stages + +| Stage | Description | +|-------|-------------| +| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb | +| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 | +| SFT | Fine-tunes on synthetic identity conversations + SmolTalk | +| Report | Generates `report.md` with metrics, samples, and system info | + ## Ancillary files -All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). - -- `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv. -- `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image. -- `assets/launch.sh` – Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation). -- `assets/README.md` – Additional detail on training stages, inference, and troubleshooting. +All required assets are in `nvidia/station-nanochat/assets/`: +- `Dockerfile` – PyTorch NGC image with nanochat pip dependencies. +- `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image. +- `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report). +- `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station. ## Time & risk -- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more). -- **Risk level:** Medium - - Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space. - - API keys (W&B, HF) must be set or the launch script will exit. -- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed. -* **Last Updated:** 03/02/2026 - * First Publication +- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra. +- **Risk level:** Medium + - Large downloads (FineWeb) can be slow; ensure stable network and disk space. + - API keys (W&B, HF) must be set or `launch.sh` will exit immediately. +- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed. + +## Credits + +- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy +- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data) +- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data) ## Instructions ## Step 1. Prerequisites and environment -This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. +Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. ```bash ## Verify GPU and Docker nvidia-smi -docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi +docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi ``` -Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them. +Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell: ```bash export WANDB_API_KEY= export HF_TOKEN= ``` -## Step 2. Clone the playbook and set up nanochat +## Step 2. Clone and set up -Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image. +Clone the playbook repository and navigate to the assets directory: ```bash git clone https://github.com/NVIDIA/dgx-spark-playbooks cd dgx-spark-playbooks/nvidia/station-nanochat/assets ``` -From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.). +Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies): ```bash ./setup.sh ``` -Setup may take several minutes while the image builds. Verify the image: +You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this: -```bash -docker images | grep nanochat +``` +assets/ +├── Dockerfile +├── launch.sh +├── setup.sh +├── speedrun_station.sh +└── nanochat/ ``` -You should see the `nanochat` image listed. +## Step 3. Launch training -## Step 3. Launch full training - -> [!NOTE] -> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running. - -To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours: +Ensure your API keys are exported, then launch: ```bash -export WANDB_API_KEY= -export HF_TOKEN= -./launch_full.sh +./launch.sh ``` -This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation. +The training runs inside the `nanochat` container and executes the full pipeline automatically: -## Step 4. Verify and use the model +1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer +2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8 +3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat +4. **Report generation** — produces `report.md` with metrics and samples -After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station. +Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run. + +## Step 4. Monitor training + +**W&B dashboard:** + +Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics: +- Training loss +- Validation BPB +- Throughput (tokens/sec) + +## Step 5. Inference + +After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively: **Web UI (recommended):** ```bash -cd nanochat -source ../.venv/bin/activate # if using venv from container context; otherwise use the container -python -m scripts.chat_web +docker run --rm --gpus all --net=host \ + -v $(pwd)/nanochat:/workspace/nanochat \ + -v $(pwd)/nanochat_cache:/root/.cache/nanochat \ + -w /workspace/nanochat \ + nanochat \ + python -m scripts.chat_web ``` Open a browser to `http://:8000` where `` is your DGX Station’s IP address. @@ -143,14 +180,15 @@ Open a browser to `http://:8000` where `` is your DGX St **CLI:** ```bash -cd nanochat -python -m scripts.chat_cli -p "Why is the sky blue?" -python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning" +docker run --rm -it --gpus all \ + -v $(pwd)/nanochat:/workspace/nanochat \ + -v $(pwd)/nanochat_cache:/root/.cache/nanochat \ + -w /workspace/nanochat \ + nanochat \ + python -m scripts.chat_cli -p "Why is the sky blue?" ``` -A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project. - -## Step 5. Cleanup +## Step 6. Cleanup To stop training early, interrupt the launch script or stop the container: @@ -160,32 +198,43 @@ To stop training early, interrupt the launch script or stop the container: ```bash ## If launch.sh is running: press Ctrl+C -## Or stop the container by name +## Or stop the container directly docker stop $(docker ps -q --filter ancestor=nanochat) ``` -To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`): +To free disk space: ```bash rm -rf ./nanochat_cache ./hf_cache docker system prune -a ``` -## Step 6. Next steps and customization +## Step 7. Customization -- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time. -- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory. -- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput). -- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`. +**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size: + +```bash +## Fewer data shards (10 instead of default) +python -m nanochat.dataset -n 10 & + +## Smaller model (d4 instead of d24), smaller batch size +python -m scripts.base_train --depth=4 --device-batch-size=32 +``` + +**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs. + +Then re-run `./setup.sh` to rebuild with the changes. ## Troubleshooting | Symptom | Cause | Fix | -|--------|--------|-----| -| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=` and `export HF_TOKEN=` in the same shell, then run `./launch.sh`. | -| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). | -| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. | -| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` don’t exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p ` and re-run `launch.sh`. | -| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). | -| Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs `. Fix env vars, cache paths, or batch size as above. | -| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. | +|---------|-------|-----| +| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=` and `export HF_TOKEN=` in the same shell, then re-run `./launch.sh` | +| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` | +| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs `. Fix env vars or paths as needed | +| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` | +| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` | +| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` | +| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct | +| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs ` | +| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) | diff --git a/nvidia/station-nanochat/assets/Dockerfile b/nvidia/station-nanochat/assets/Dockerfile index 9c59128..98396bb 100755 --- a/nvidia/station-nanochat/assets/Dockerfile +++ b/nvidia/station-nanochat/assets/Dockerfile @@ -1,11 +1,15 @@ -FROM nvcr.io/nvidia/pytorch:25.09-py3 +FROM nvcr.io/nvidia/pytorch:26.04-py3 WORKDIR /workspace -# Install dependencies globally so torchrun (which uses /usr/bin/python) can access them -RUN /usr/bin/python -m pip install tiktoken tokenizers datasets psutil files-to-prompt regex setuptools uvicorn wandb maturin - -# Create venv with --system-site-packages so it inherits global packages -RUN /usr/bin/python -m venv --system-site-packages .venv +RUN pip install \ + datasets \ + tokenizers \ + wandb \ + tiktoken \ + psutil \ + files-to-prompt \ + uvicorn \ + rustbpe CMD ["/bin/bash"] \ No newline at end of file diff --git a/nvidia/station-nanochat/assets/launch.sh b/nvidia/station-nanochat/assets/launch.sh index 5b9a5d8..a05c708 100755 --- a/nvidia/station-nanochat/assets/launch.sh +++ b/nvidia/station-nanochat/assets/launch.sh @@ -3,7 +3,6 @@ # SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # -# Lite training (default). Runs speedrun.sh, which setup copies from speedrun_lite.sh. # Get wandb API key export WANDB_API_KEY=$WANDB_API_KEY @@ -11,7 +10,6 @@ if [ -z "$WANDB_API_KEY" ]; then echo "WANDB_API_KEY is not set" exit 1 fi - export WANDB_RUN=${WANDB_RUN:-speedrun} # Get Hugging Face API key @@ -21,26 +19,23 @@ if [ -z "$HF_TOKEN" ]; then exit 1 fi -# Cleanup function to stop containers +# Use local cache dirs so no root paths are required +workdir=$(pwd) +NANOCHAT_CACHE="$(pwd)/nanochat_cache" +HF_CACHE="$(pwd)/hf_cache" + cleanup() { - echo - echo "Stopping containers..." - docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null || true - echo "Interrupted training!" + echo -e "\nStopping training container..." + docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null + echo "Cleanup complete." exit 0 } -workdir=$(pwd) -# DGX Station: use local cache dirs so no root paths are required -NANOCHAT_CACHE="${NANOCHAT_CACHE:-$(pwd)/nanochat_cache}" -HF_CACHE="${HF_CACHE:-$(pwd)/hf_cache}" -mkdir -p "$NANOCHAT_CACHE" "$HF_CACHE" +trap cleanup SIGINT SIGTERM -cmd=" -mkdir -p /nanochat_cache && \ -mkdir -p /hf_cache && \ -chmod 777 /nanochat_cache && \ -chmod 777 /hf_cache && \ +# Launch Nanochat training +cmd="mkdir -p $NANOCHAT_CACHE $HF_CACHE && \ +chmod u+rwx $NANOCHAT_CACHE $HF_CACHE && \ docker run \ --rm \ --runtime=nvidia \ @@ -57,16 +52,8 @@ docker run \ -v $HF_CACHE:/root/.cache/huggingface \ -w /workspace/nanochat \ nanochat \ - bash speedrun.sh" - + bash runs/speedrun.sh" sh -c "$cmd" & -sleep 5 -while true; do - if ! docker ps | grep -q "nanochat"; then - echo - echo "Training complete!" - exit 0 - fi - sleep 1 -done \ No newline at end of file +wait +echo -e "\nTraining complete!" diff --git a/nvidia/station-nanochat/assets/setup.sh b/nvidia/station-nanochat/assets/setup.sh index 57fe205..a35d12e 100755 --- a/nvidia/station-nanochat/assets/setup.sh +++ b/nvidia/station-nanochat/assets/setup.sh @@ -11,10 +11,10 @@ assets_dir="$(cd "$(dirname "$0")" && pwd)" cmd="cd $workdir && \ git clone https://github.com/karpathy/nanochat.git && \ cd nanochat && \ -git checkout c6b7ab744055d5915e6ccb61088de80c10cbaff9 && \ -cp ../speedrun_spark.sh ./speedrun.sh && \ +git checkout 0aaca56805eb13f6e6e1fff789a08086902f12ab && \ +cp ../speedrun_station.sh ./runs/speedrun.sh && \ cd .. && \ -chmod +x launch_full.sh 2>/dev/null || true && \ +chmod +x launch.sh 2>/dev/null || true && \ docker build -t nanochat ." sh -c "$cmd" diff --git a/nvidia/station-nanochat/assets/speedrun_station.sh b/nvidia/station-nanochat/assets/speedrun_station.sh index 0788021..100a699 100644 --- a/nvidia/station-nanochat/assets/speedrun_station.sh +++ b/nvidia/station-nanochat/assets/speedrun_station.sh @@ -1,15 +1,14 @@ #!/bin/bash -set -e -# This script is the "Best ChatGPT clone that $100 can buy", -# It is designed to run in ~4 hours on 8XH100 node at $3/GPU/hour. +# This script is configured to train your own GPT-2 grade LLM (pretraining + finetuning) +# It is designed to run on a blank 8XH100 GPU node and takes approximately 3 hours to complete. # 1) Example launch (simplest): -# bash speedrun.sh -# 2) Example launch in a screen session (because the run takes ~4 hours): -# screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh +# bash runs/speedrun.sh +# 2) Example launch in a screen session (because the run takes ~3 hours): +# screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh # 3) Example launch with wandb logging, but see below for setting up wandb first: -# WANDB_RUN=speedrun screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh +# WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh # Default intermediate artifacts directory is in ~/.cache/nanochat export OMP_NUM_THREADS=1 @@ -26,7 +25,7 @@ mkdir -p $NANOCHAT_BASE_DIR # install the repo dependencies # uv sync --extra gpu # activate venv so that `python` uses the project's venv instead of system python -source ../.venv/bin/activate +# source .venv/bin/activate # ----------------------------------------------------------------------------- # wandb setup @@ -49,70 +48,41 @@ python -m nanochat.report reset # ----------------------------------------------------------------------------- # Tokenizer -# Install Rust / Cargo -curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y -source "$HOME/.cargo/env" - -# Build the rustbpe Tokenizer -# unset VIRTUAL_ENV -maturin develop --release --manifest-path rustbpe/Cargo.toml - # Download the first ~2B characters of pretraining dataset -# look at dev/repackage_data_reference.py for details on how this data was prepared # each data shard is ~250M chars # so we download 2e9 / 250e6 = 8 data shards at this point # each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk +# look at dev/repackage_data_reference.py for details on how this data was prepared python -m nanochat.dataset -n 8 # Immediately also kick off downloading more shards in the background while tokenizer trains -# See comment below for why 240 is the right number here -python -m nanochat.dataset -n 240 & +# Approximately 150 shards are needed for GPT-2 capability pretraining, add 20 for padding. +# The maximum total number of shards available in the entire dataset is 6542. +python -m nanochat.dataset -n 170 & DATASET_DOWNLOAD_PID=$! -# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data -python -m scripts.tok_train --max_chars=2000000000 +# train the tokenizer with vocab size 2**15 = 32768 on ~2B characters of data +python -m scripts.tok_train # evaluate the tokenizer (report compression ratio etc.) python -m scripts.tok_eval # ----------------------------------------------------------------------------- # Base model (pretraining) - -# The d20 model is 561M parameters. -# Chinchilla says #tokens = 20X #params, so we need 561e6 * 20 = 11.2B tokens. -# Assume our tokenizer is 4.8 chars/token, this is 11.2B * 4.8 ~= 54B chars. -# At 250M chars/shard, this is 54B / 250M ~= 216 shards needed for pretraining. -# Round up to 240 for safety. At ~100MB/shard, this downloads ~24GB of data to disk. -# (The total number of shards available in the entire dataset is 1822.) echo "Waiting for dataset download to complete..." wait $DATASET_DOWNLOAD_PID -source ../.venv/bin/activate - -# pretrain the d20 model -python -m scripts.base_train --depth=20 --run=$WANDB_RUN -# evaluate the model on a larger chunk of train/val data and draw some samples -python -m scripts.base_loss -# evaluate the model on CORE tasks -python -m scripts.base_eval - -sleep 5 +# d24 model (slightly undertrained to beat GPT-2 => decrease data:params ratio from compute optimal 10.5 (default) to 8) +python -m scripts.base_train --depth=24 --target-param-data-ratio=8 --device-batch-size=64 --fp8 --run=$WANDB_RUN +# evaluate the model: CORE metric, BPB on train/val, and draw samples +python -m scripts.base_eval --device-batch-size=64 # ----------------------------------------------------------------------------- -# Midtraining (teach the model conversation special tokens, tool use, multiple choice) +# SFT (teach the model conversation special tokens, tool use, multiple choice) # download 2.3MB of synthetic identity conversations to impart a personality to nanochat -# see dev/gen_sft_data.py for details on how this data was prepared and to get a sense of how you can easily tune it +# see dev/gen_synthetic_data.py for details on how this data was prepared and to get a sense of how you can easily tune it curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl -# run midtraining and eval the model -python -m scripts.mid_train --run=$WANDB_RUN -python -m scripts.chat_eval -i mid - -sleep 5 - -# ----------------------------------------------------------------------------- -# Supervised Finetuning (domain adaptation to each sequence all by itself per row) - -# train sft and re-eval right away (should see a small bump) -python -m scripts.chat_sft --run=$WANDB_RUN +# run SFT and eval the model +python -m scripts.chat_sft --device-batch-size=64 --run=$WANDB_RUN python -m scripts.chat_eval -i sft # chat with the model over CLI! Leave out the -p to chat interactively @@ -121,15 +91,6 @@ python -m scripts.chat_eval -i sft # even better, chat with your model over a pretty WebUI ChatGPT style # python -m scripts.chat_web -# ----------------------------------------------------------------------------- -# Reinforcement Learning. Optional, and currently only on GSM8K -# (optional) - -# run reinforcement learning -# python -m scripts.chat_rl --run=$WANDB_RUN -# eval the RL model only on GSM8K -# python -m scripts.chat_eval --i rl -a GSM8K - # ----------------------------------------------------------------------------- # Generate the full report by putting together all the sections # report.md is the output and will be copied to current directory for convenience diff --git a/nvidia/station-nanochat/endpoint-test.yaml b/nvidia/station-nanochat/endpoint-test.yaml index 0da04f6..03a1063 100644 --- a/nvidia/station-nanochat/endpoint-test.yaml +++ b/nvidia/station-nanochat/endpoint-test.yaml @@ -45,18 +45,16 @@ spec: content: | # Basic idea - This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI. + This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI. - The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation. + The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision. # What you'll accomplish - You will have a working nanochat setup that trains a small LLM and serves it for chat. - - - **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station. - - **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation. - - **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints. - - **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples. + - **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station. + - **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation. + - **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints. + - **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples. # What to know before starting @@ -68,36 +66,58 @@ spec: **Hardware:** - - NVIDIA DGX Station with GB300 Ultra Superchip. - - Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun). - - Adequate storage for cache (~24GB+ for FineWeb data and checkpoints). + - NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM). + - Adequate storage for cache (~25GB+ for FineWeb data and checkpoints). **Software:** - - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi` - - Network access to download datasets (Hugging Face, FineWeb) and container images. + - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi` + - Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io) - [Weights & Biases](https://wandb.ai/) account and API key. - [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets. + # Model architecture (d24) + + ``` + Layers: 24 + Attention Heads: 12 + Head Dimension: 128 + Context Length: 2048 tokens + Vocabulary Size: 65,536 (2^16, trained BPE) + Precision: FP8 (e4m3, tensorwise scaling) + ``` + + # Training stages + + | Stage | Description | + |-------|-------------| + | Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb | + | Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 | + | SFT | Fine-tunes on synthetic identity conversations + SmolTalk | + | Report | Generates `report.md` with metrics, samples, and system info | + # Ancillary files - All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). - - - `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv. - - `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image. - - `assets/launch.sh` – Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation). - - `assets/README.md` – Additional detail on training stages, inference, and troubleshooting. + All required assets are in `nvidia/station-nanochat/assets/`: + - `Dockerfile` – PyTorch NGC image with nanochat pip dependencies. + - `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image. + - `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report). + - `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station. # Time & risk - - **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more). - - **Risk level:** Medium - - Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space. - - API keys (W&B, HF) must be set or the launch script will exit. - - **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed. - * **Last Updated:** 03/02/2026 - * First Publication + - **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra. + - **Risk level:** Medium + - Large downloads (FineWeb) can be slow; ensure stable network and disk space. + - API keys (W&B, HF) must be set or `launch.sh` will exit immediately. + - **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed. + + # Credits + + - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy + - [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data) + - [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data) @@ -108,69 +128,86 @@ spec: content: | # Step 1. Prerequisites and environment - This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. + Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. ```bash # Verify GPU and Docker nvidia-smi - docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi + docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi ``` - Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them. + Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell: ```bash export WANDB_API_KEY= export HF_TOKEN= ``` - # Step 2. Clone the playbook and set up nanochat + # Step 2. Clone and set up - Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image. + Clone the playbook repository and navigate to the assets directory: ```bash git clone https://github.com/NVIDIA/dgx-spark-playbooks cd dgx-spark-playbooks/nvidia/station-nanochat/assets ``` - From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.). + Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies): ```bash ./setup.sh ``` - Setup may take several minutes while the image builds. Verify the image: + You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this: - ```bash - docker images | grep nanochat + ``` + assets/ + ├── Dockerfile + ├── launch.sh + ├── setup.sh + ├── speedrun_station.sh + └── nanochat/ ``` - You should see the `nanochat` image listed. + # Step 3. Launch training - # Step 3. Launch full training - - > [!NOTE] - > The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running. - - To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours: + Ensure your API keys are exported, then launch: ```bash - export WANDB_API_KEY= - export HF_TOKEN= - ./launch_full.sh + ./launch.sh ``` - This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation. + The training runs inside the `nanochat` container and executes the full pipeline automatically: - # Step 4. Verify and use the model + 1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer + 2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8 + 3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat + 4. **Report generation** — produces `report.md` with metrics and samples - After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station. + Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run. + + # Step 4. Monitor training + + **W&B dashboard:** + + Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics: + - Training loss + - Validation BPB + - Throughput (tokens/sec) + + # Step 5. Inference + + After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively: **Web UI (recommended):** ```bash - cd nanochat - source ../.venv/bin/activate # if using venv from container context; otherwise use the container - python -m scripts.chat_web + docker run --rm --gpus all --net=host \ + -v $(pwd)/nanochat:/workspace/nanochat \ + -v $(pwd)/nanochat_cache:/root/.cache/nanochat \ + -w /workspace/nanochat \ + nanochat \ + python -m scripts.chat_web ``` Open a browser to `http://:8000` where `` is your DGX Station’s IP address. @@ -178,14 +215,15 @@ spec: **CLI:** ```bash - cd nanochat - python -m scripts.chat_cli -p "Why is the sky blue?" - python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning" + docker run --rm -it --gpus all \ + -v $(pwd)/nanochat:/workspace/nanochat \ + -v $(pwd)/nanochat_cache:/root/.cache/nanochat \ + -w /workspace/nanochat \ + nanochat \ + python -m scripts.chat_cli -p "Why is the sky blue?" ``` - A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project. - - # Step 5. Cleanup + # Step 6. Cleanup To stop training early, interrupt the launch script or stop the container: @@ -195,23 +233,32 @@ spec: ```bash # If launch.sh is running: press Ctrl+C - # Or stop the container by name + # Or stop the container directly docker stop $(docker ps -q --filter ancestor=nanochat) ``` - To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`): + To free disk space: ```bash rm -rf ./nanochat_cache ./hf_cache docker system prune -a ``` - # Step 6. Next steps and customization + # Step 7. Customization - - **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time. - - **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory. - - **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput). - - **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`. + **Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size: + + ```bash + # Fewer data shards (10 instead of default) + python -m nanochat.dataset -n 10 & + + # Smaller model (d4 instead of d24), smaller batch size + python -m scripts.base_train --depth=4 --device-batch-size=32 + ``` + + **Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs. + + Then re-run `./setup.sh` to rebuild with the changes. @@ -221,14 +268,16 @@ spec: label: Troubleshooting content: | | Symptom | Cause | Fix | - |--------|--------|-----| - | `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=` and `export HF_TOKEN=` in the same shell, then run `./launch.sh`. | - | `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). | - | Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. | - | `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` don’t exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p ` and re-run `launch.sh`. | - | `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). | - | Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs `. Fix env vars, cache paths, or batch size as above. | - | Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. | + |---------|-------|-----| + | `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=` and `export HF_TOKEN=` in the same shell, then re-run `./launch.sh` | + | `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` | + | Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs `. Fix env vars or paths as needed | + | `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` | + | `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` | + | Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` | + | W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct | + | Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs ` | + | GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) | diff --git a/nvidia/station-nvfp4-pretraining/endpoint-production.yaml b/nvidia/station-nvfp4-pretraining/endpoint-production.yaml index 8f4fbc6..97da6e8 100644 --- a/nvidia/station-nvfp4-pretraining/endpoint-production.yaml +++ b/nvidia/station-nvfp4-pretraining/endpoint-production.yaml @@ -87,7 +87,7 @@ spec: - NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip) - Docker installed with GPU support - NVIDIA Container Toolkit configured - - Megatron-Bridge installed (via the the NeMo Framework NGC container) + - Megatron-Bridge installed (via the NeMo Framework NGC container) Verify your setup: @@ -139,7 +139,7 @@ spec: nvcr.io/nvidia/nemo:${TAG} ``` - All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container** . + All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container**. # Step 2. Review the pretraining script @@ -279,7 +279,7 @@ spec: | `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism | | `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible | | Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate | - | `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` | + | `--disable-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` | | Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization | | Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | diff --git a/nvidia/station-sglang-inference/endpoint-production.yaml b/nvidia/station-sglang-inference/endpoint-production.yaml index 4d5ae71..06507ab 100644 --- a/nvidia/station-sglang-inference/endpoint-production.yaml +++ b/nvidia/station-sglang-inference/endpoint-production.yaml @@ -330,7 +330,7 @@ spec: ```bash git clone https://github.com/NVIDIA/dgx-spark-playbooks - cd dgx-station-playbooks/nvidia/station-sglang-inference + cd dgx-spark-playbooks/nvidia/station-sglang-inference ``` > [!TIP] diff --git a/nvidia/station-vllm/endpoint-production.yaml b/nvidia/station-vllm/endpoint-production.yaml index c003724..e22f1c8 100644 --- a/nvidia/station-vllm/endpoint-production.yaml +++ b/nvidia/station-vllm/endpoint-production.yaml @@ -1,8 +1,8 @@ kind: Playbook metadata: name: station-vllm - displayName: Serve Qwen3-235B with vLLM - shortDescription: Set up vLLM server with Qwen3-235B on DGX Station + displayName: vLLM for Inference + shortDescription: Install and use vLLM on DGX Station publisher: nvidia description: | # REPLACE THIS WITH YOUR MODEL CARD @@ -15,7 +15,7 @@ metadata: attributes: - key: DURATION - value: 20 MIN + value: 30 MIN spec: artifactName: station-vllm @@ -42,7 +42,9 @@ spec: # What you'll accomplish - Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU. + Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models. + + You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture. # What to know before starting @@ -57,21 +59,30 @@ spec: - HuggingFace account with access token - Network access to NGC and HuggingFace + # Model Support Matrix + + The following models are supported with vLLM on DGX Station. All listed models are available and ready to use: + + | Model | Quantization | Support Status | HF Handle | + |-------|-------------|----------------|-----------| + | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) | + | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) | + | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) | # Time & risk - * **Duration:** 15-20 minutes (longer on first run due to model download) + * **Duration:** 30 minutes (longer on first run due to model download) * **Risks:** Model download requires HuggingFace authentication * **Rollback:** Stop and remove the container to restore state - * **Last Updated:** 03/02/2026 - * First Publication + * **Last Updated:** 05/28/2026 + * Update models - id: instructions - label: Serve Qwen3-235B + label: Instructions content: | # Step 1. Set up Docker permissions @@ -92,7 +103,7 @@ spec: export HF_TOKEN="your_huggingface_token" # Model to serve - export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4" + export MODEL_HANDLE="" # Maximum context length export MAX_MODEL_LEN=8192 @@ -106,9 +117,16 @@ spec: docker pull nvcr.io/nvidia/vllm:26.01-py3 ``` + For Step-3.7-Flash models, pull the custom VLLM container + ```bash + docker pull vllm/vllm-openai:stepfun37 + ``` + # Step 4. Start vLLM server - Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. + Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. + + For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300. ```bash docker run -d \ @@ -126,6 +144,28 @@ spec: --gpu-memory-utilization 0.9 ``` + For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300. + + ```bash + docker run -d \ + --name vllm-server \ + --gpus all \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e HF_TOKEN="$HF_TOKEN" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + vllm/vllm-openai:stepfun37 \ + "$MODEL_HANDLE" \ + --gpu-memory-utilization 0.95 \ + --trust-remote-code \ + --reasoning-parser step3p5 \ + --enable-auto-tool-choice \ + --tool-call-parser step3p5 \ + --kv-cache-dtype fp8 + ``` + Check the server logs for startup progress: ```bash @@ -135,7 +175,7 @@ spec: Expected output includes: - Model download progress (first run only) - Model loading into GPU memory - - `Uvicorn running on http://0.0.0.0:8000` + - `Application startup complete.` Press `Ctrl+C` to exit log view once the server is ready. @@ -166,9 +206,10 @@ spec: Optionally, remove the image and cached model: + Eg. ```bash - docker rmi nvcr.io/nvidia/vllm:26.01-py3 - rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4 + docker rmi "" + rm -rf $HOME/.cache/huggingface/hub/"" ``` diff --git a/nvidia/station-vllm/endpoint-test.yaml b/nvidia/station-vllm/endpoint-test.yaml index 42a2056..c003724 100644 --- a/nvidia/station-vllm/endpoint-test.yaml +++ b/nvidia/station-vllm/endpoint-test.yaml @@ -1,8 +1,8 @@ kind: Playbook metadata: name: station-vllm - displayName: vLLM for Inference - shortDescription: Install and use vLLM on DGX Station + displayName: Serve Qwen3-235B with vLLM + shortDescription: Set up vLLM server with Qwen3-235B on DGX Station publisher: nvidia description: | # REPLACE THIS WITH YOUR MODEL CARD @@ -15,7 +15,7 @@ metadata: attributes: - key: DURATION - value: 30 MIN + value: 20 MIN spec: artifactName: station-vllm @@ -42,9 +42,7 @@ spec: # What you'll accomplish - Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models. - - You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture. + Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU. # What to know before starting @@ -59,30 +57,21 @@ spec: - HuggingFace account with access token - Network access to NGC and HuggingFace - # Model Support Matrix - - The following models are supported with vLLM on Spark. All listed models are available and ready to use: - - | Model | Quantization | Support Status | HF Handle | - |-------|-------------|----------------|-----------| - | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) | - | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) | - | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) | # Time & risk - * **Duration:** 30 minutes (longer on first run due to model download) + * **Duration:** 15-20 minutes (longer on first run due to model download) * **Risks:** Model download requires HuggingFace authentication * **Rollback:** Stop and remove the container to restore state - * **Last Updated:** 05/28/2026 - * Update models + * **Last Updated:** 03/02/2026 + * First Publication - id: instructions - label: Instructions + label: Serve Qwen3-235B content: | # Step 1. Set up Docker permissions @@ -103,7 +92,7 @@ spec: export HF_TOKEN="your_huggingface_token" # Model to serve - export MODEL_HANDLE="" + export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4" # Maximum context length export MAX_MODEL_LEN=8192 @@ -117,16 +106,9 @@ spec: docker pull nvcr.io/nvidia/vllm:26.01-py3 ``` - For Step-3.7-Flash models, pull the custom VLLM container - ```bash - docker pull vllm/vllm-openai:stepfun37 - ``` - # Step 4. Start vLLM server - Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. - - For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300. + Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. ```bash docker run -d \ @@ -144,28 +126,6 @@ spec: --gpu-memory-utilization 0.9 ``` - For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300. - - ```bash - docker run -d \ - --name vllm-server \ - --gpus all \ - --ipc host \ - --ulimit memlock=-1 \ - --ulimit stack=67108864 \ - -p 8000:8000 \ - -e HF_TOKEN="$HF_TOKEN" \ - -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ - vllm/vllm-openai:stepfun37 \ - "$MODEL_HANDLE" \ - --gpu-memory-utilization 0.95 \ - --trust-remote-code \ - --reasoning-parser step3p5 \ - --enable-auto-tool-choice \ - --tool-call-parser step3p5 \ - --kv-cache-dtype fp8 - ``` - Check the server logs for startup progress: ```bash @@ -175,7 +135,7 @@ spec: Expected output includes: - Model download progress (first run only) - Model loading into GPU memory - - `Application startup complete.` + - `Uvicorn running on http://0.0.0.0:8000` Press `Ctrl+C` to exit log view once the server is ready. @@ -206,10 +166,9 @@ spec: Optionally, remove the image and cached model: - Eg. ```bash - docker rmi "" - rm -rf $HOME/.cache/huggingface/hub/"" + docker rmi nvcr.io/nvidia/vllm:26.01-py3 + rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4 ```