dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/endpoint-test.yaml
2026-05-29 00:08:55 +00:00

303 lines
15 KiB
YAML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

kind: Playbook
metadata:
name: station-nvfp4-pretraining
displayName: NVFP4 Pretraining with Megatron Bridge
shortDescription: Pretrain Llama 3.1 8B with NVFP4 mixed precision on DGX Station using Megatron Bridge
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- Training
- Megatron Bridge
- NVFP4
attributes:
- key: DURATION
value: 30 MIN
spec:
artifactName: station-nvfp4-pretraining
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
tabs:
-
id: overview
label: Overview
content: |
# NVFP4 training
NVFP4 is a 4-bit floating-point format natively supported by NVIDIA Blackwell Tensor Cores.
When applied during **pretraining**, NVFP4 reduces memory bandwidth and compute cost for matrix multiplications while preserving model quality through mixed-precision accumulation in higher precision (BF16/FP32).
Megatron-Bridge is NVIDIA's library for large-scale distributed training built on top of Megatron-Core.
It provides composable recipe configs for models, optimizers, and mixed-precision strategies — including the first-class `bf16_with_nvfp4_mixed` recipe used in this playbook.
Combining the two lets you pretrain LLMs at lower memory cost and higher throughput compared to BF16-only training, with minimal accuracy trade-off.
Key benefits:
- **~2× higher training throughput vs BF16** - Higher TFLOPs at minimal loss in model quality
- **Native Blackwell NVFP4 GEMMs** — FP4 matmuls run as a single Tensor Core instruction, no software emulation overhead
- **Recipe-based configuration** — swap between `bf16_mixed`, `bf16_with_fp8_current_scaling_mixed`, and `bf16_with_nvfp4_mixed` with a single line
- **Stability controls** — pin the first/last N transformer layers in BF16 (this playbook keeps the last 4 layers in BF16 via `first_last_layers_bf16`)
- **~2× memory reduction** - For inference weight storage vs FP8, ~3.5× vs FP16
# What you'll accomplish
Pretrain a **Llama 3.1 8B** model using Megatron-Bridge with NVFP4 mixed precision on NVIDIA DGX Station.
You'll run a short training loop with mock data to verify the full pipeline end-to-end, compare against a plain BF16 baseline via the `--disable-fp4` flag and then learn how to point it at real data if required.
# Measured results
Run settings:
- Model: Llama 3.1 8B (`llama3_8b_pretrain_config()`)
- 50 iterations, 2 warmup
- Global batch size 64, micro batch size 4, sequence length 4096
- Dummy data (Megatron-Core's built-in `MockGPTDataset` — synthetic random token IDs, no real corpus)
- Single GB300 GPU, `nvcr.io/nvidia/nemo:26.04` container
- Latency: average of iterations 2050 (iter 10 includes one-time CUDA-graph/compile overhead)
- VRAM: peak of `nvidia-smi --query-compute-apps=used_memory` sampled every 2 s during the run
| Precision | Recipe | Avg step time | Throughput (Model TFLOP/s/GPU) | Peak VRAM |
|---|---|---|---|---|
| BF16 baseline | `bf16_mixed()` | 9.05 s | ~1399 | 221.6 GB |
| NVFP4 (last-4 BF16) | `bf16_with_nvfp4_mixed()` + `first_last_layers_bf16=True`, `num_layers_at_end_in_bf16=4` | **5.39 s** | **~2347** | **207.8 GB** |
NVFP4 is **1.68× faster** than BF16 (≈68% higher throughput) with ≈13.8 GB (≈6%) less peak VRAM — the regime NVFP4 was designed for, where matmul FLOPs dominate each step and quantization overhead is amortized over wide linear projections.
# What to know before starting
- Basic Python and PyTorch usage
- Familiarity with distributed training concepts (`torchrun`)
- Understanding of mixed precision training (FP16/BF16/FP8)
# Prerequisites
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
- Docker installed with GPU support
- NVIDIA Container Toolkit configured
- Megatron-Bridge installed (via the NeMo Framework NGC container)
Verify your setup:
```bash
# Check GPU availability and architecture
nvidia-smi
# Verify Python and torch
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
```
# Time & risk
* **Estimated duration**: 20-30 minutes (quick test loop with default `--train-iters 50`); longer for real data
* **Risks**:
* NVFP4 requires Blackwell GPUs — will fail on Hopper or older
* Mock data is used by default (`eval_iters=0`); real data requires a preprocessed Megatron-format dataset
* **Rollback**: Stop the `torchrun` process and remove any checkpoint directories
* **Last Updated:** 05/26/2026
* First Publication
-
id: instructions
label: Pretrain with NVFP4
content: |
# Step 1. Set up the environment
The recommended way to run Megatron-Bridge on DGX Station is through the [NeMo Framework container](https://github.com/NVIDIA-NeMo/Megatron-Bridge#-nemo-framework-container), which includes Megatron-Bridge, Megatron-Core, Transformer Engine, and all CUDA dependencies pre-installed. Running outside the container is not supported in this playbook — the NVFP4 kernels rely on the exact Transformer Engine / CUDA versions shipped inside the image.
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nvfp4-pretraining/assets
# Use the latest nemo tag
export TAG=26.04
docker run --rm -it \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$(pwd):/workdir" \
-w /workdir \
--entrypoint bash \
nvcr.io/nvidia/nemo:${TAG}
```
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container**.
# Step 2. Review the pretraining script
The pretraining script can be found at `pretrain_llama.py`. The key piece is the NVFP4 precision config, built on top of Megatron-Bridge's prebuilt `bf16_with_nvfp4_mixed` recipe:
```python
from megatron.bridge.training.mixed_precision import bf16_with_nvfp4_mixed
def nvfp4_mixed_precision():
cfg = bf16_with_nvfp4_mixed()
cfg.first_last_layers_bf16 = True
cfg.num_layers_at_start_in_bf16 = 0
cfg.num_layers_at_end_in_bf16 = 4
return cfg
```
`bf16_with_nvfp4_mixed()` already sets `fp8="e4m3"` and `fp8_recipe="nvfp4"` under the hood; we just toggle the layer-pinning knobs on top:
- **Last 4 layers in BF16** (`num_layers_at_end_in_bf16=4`) for training stability (adjustable per model)
- **No start-layer pinning** (`num_layers_at_start_in_bf16=0`) — last-layer stability is usually enough
> [!NOTE]
> The script uses `llama3_8b_pretrain_config()` which defaults to `context_parallel_size=2`. The script overrides this to `context_parallel_size=1` for single-GPU runs. If you swap in a larger recipe (e.g. `nemotron_3_nano_pretrain_config`, which defaults to TP=4), you **must** either launch `torchrun --nproc_per_node=4` on a 4-GPU node or override `config.model.tensor_model_parallel_size = 1` before calling `pretrain(...)`, or you will hit:
> `AssertionError: world size (1) is not divisible by total_model_size (...tensor_model_parallel_size=4 * ...)`.
# Step 3. Launch NVFP4 pre-training
Launch a short training run with mock data and tee the output to a log file so you can inspect VRAM and per-iteration latency afterwards:
```bash
torchrun --nproc_per_node=1 pretrain_llama.py > nvfp4.log 2>&1
```
Expected output (see `nvfp4.log`):
- Model initialization logs and a `Theoretical memory footprints: weight and optimizer=...` line
- Iteration progress printed every step (`log_interval=1`), e.g. `iteration 10/50 | ... elapsed time per iteration (ms): ... | lm loss: ...`
- A `[Rank 0] ... memory (GB) | mem-max-reserved-gigabytes: ...` line — this is your peak VRAM
- A checkpoint saved to `/workdir/nemo_experiments/default/checkpoints`
If the run finishes with `EXIT=0` (or no traceback), your NVFP4 pretraining setup is working.
# Step 4. Compare with BF16 baseline
Run the same script with `--disable-fp4` to establish a BF16 baseline, again logging to a file:
```bash
# Remove the prior checkpoint directory so the two runs don't interfere
rm -rf nemo_experiments
torchrun --nproc_per_node=1 pretrain_llama.py --disable-fp4 > bf16.log 2>&1
```
To compare the two runs on **latency** and **throughput**, grep the per-iteration lines out of each log:
```bash
grep -E "elapsed time per iteration|MODEL_TFLOP" nvfp4.log
grep -E "elapsed time per iteration|MODEL_TFLOP" bf16.log
```
Each step prints two lines:
- `Step Time : 5.39s GPU utilization: 2347.0MODEL_TFLOP/s/GPU` — step latency and throughput
- `iteration 10/50 | ... elapsed time per iteration (ms): 5390 | ... lm loss: ...` — same latency in ms plus loss
Iteration 10 includes one-time CUDA-graph/compile overhead, so average iterations 2050 for a fair per-step latency number.
### Measuring peak VRAM (from `nvidia-smi`)
Megatron's in-log memory numbers (`mem-max-reserved-gigabytes`) reflect PyTorch's caching-allocator reservation, which can drift from what the device actually holds. For an accurate read, watch `nvidia-smi` live from a second shell while training runs:
```bash
watch -n 1 nvidia-smi
```
See the measured numbers in `overview.md` for expected VRAM and latency on 1× GB300 with Llama 3.1 8B.
# Step 5. Script arguments
`pretrain_llama.py` accepts the following arguments:
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--disable-fp4` | flag | off | Disable NVFP4; use plain BF16 mixed precision as a baseline |
| `--train-iters` | int | 50 | Number of training iterations |
| `--warmup-iters` | int | 2 | Number of warmup iterations |
| `--global-batch-size` | int | 64 | Global batch size |
| `--micro-batch-size` | int | 4 | Micro batch size (drives peak VRAM; increase to use more memory) |
| `--seq-length` | int | 4096 | Sequence length |
Example combining several flags:
```bash
torchrun --nproc_per_node=1 pretrain_llama.py \
--train-iters 50 --warmup-iters 2 \
--global-batch-size 64 --micro-batch-size 4 --seq-length 4096
```
# Step 6. Point to real data
To train on your own dataset, modify the config in the script:
```python
config = llama3_8b_pretrain_config()
config.data.data_path = "/path/to/your/preprocessed/dataset"
config.train.train_iters = 5000
config.train.global_batch_size = 256
config.train.micro_batch_size = 2
```
Megatron-Bridge expects preprocessed data in Megatron format. See the [Megatron-Bridge data preparation guide](https://docs.nvidia.com/nemo/megatron-bridge/latest/) for details.
# Step 7. Cleanup
Remove checkpoints and log files generated by the runs:
```bash
rm -rf nemo_experiments/ nvfp4.log bf16.log
```
Then exit the container shell (`exit`) — the `--rm` flag in Step 1 deletes it automatically.
# References
- Quickstart: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tutorials/recipes/llama/00_quickstart_pretrain.py
- Mixed precision: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/mixed-precision.html
- API: https://docs.nvidia.com/nemo/megatron-bridge/latest/apidocs/bridge/bridge.training.mixed_precision.html
-
id: troubleshooting
label: Troubleshooting
content: |
| Symptom | Cause | Fix |
|---------|-------|-----|
| `RuntimeError: NVFP4 is not supported on this GPU` or similar FP4 error | GPU is not Blackwell architecture | NVFP4 requires Blackwell GPUs (GB200, GB300). Check with `nvidia-smi` |
| `ModuleNotFoundError: No module named 'megatron.bridge'` | Megatron Bridge not installed | Run `pip install megatron-bridge` or use the NGC container |
| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism |
| `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible |
| Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate |
| `--disable-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization |
| Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
resources:
- name: Megatron Bridge Documentation
url: https://docs.nvidia.com/nemo/megatron-bridge/latest/
- name: Mixed Precision Training Guide
url: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/mixed-precision.html
- name: Megatron Bridge GitHub
url: https://github.com/NVIDIA-NeMo/Megatron-Bridge